SlideShare a Scribd company logo
1 of 41
Graph Analytics
For Fun and Profit
Hello!
I am David Bechberger
Sr. Architect for Data and Analytics at Gene by
Gene, a bioinformatics company specializing
in genetic genealogy.
You can find me at:
@bechbd
www.linkedin.com/in/davebechberger
What we do at
Swab Sequence Analysis Insight
What this talk isn’t
◎A through review of graph analytic
techniques
◎A review of all graph analytic frameworks
◎A deep dive into any of the techniques we
discuss
What this talk is
◎Where to start with Graph Analytics
◎OLTP and OLAP in Gremlin
◎Practical Examples using …..
Family
Trees
◎We all have them
◎I know them well
◎They are natural
graphs
Or more specifically this
name
owns individual
family
tree
member_of
is_known_as
is_spouse
is_first_cousin
Example - Find the names of all family members in a tree
T1
F1
I1
Bob
F2
I2
I3
I4
Steve
Joan
Rick
owns
member_of:
Husband
member_of:
Sonis_known _as
is_known _as
is_known _as
is_known _as
member_of:
Husband
member_of:
Wife
member_of:
Wife
Gremlin Example - Finding the names of all family members
for tree owner
g.V().has(‘tree’, ‘unique_id, ‘T1')
.out(‘owns’)
.sideEffect(
out('is_known_as').properties('full_name')
.store('name')
)
.out('member_of').in('member_of')
.sideEffect(
out('is_known_as').properties('full_name')
.store('name')
)
.cap('name')
◎Tinkerpop supports both
◎Gremlin can be used to
query in either
◎But their are differences….
Apache Tinkerpop Gremlin OLTP and OLAP
OLTP
◎ Depth First
◎ Lazy Evaluation - Low
memory usage
◎ Real-time (ms/sub-
sec)
Gremlin OLTP versus OLAP
OLAP
◎ Breadth First
◎ Eager evaluation -
High memory usage
◎ Long Running
(min/hour)
OLTP
◎ Cannot run certain
queries or steps (e.g.
pageRank, bulk
loading)
◎ Limited time a query
◎ Local operations
Limitations
OLAP
◎ Some steps are
prohibitive like path(),
simplePath(), etc.
◎ Barrier Steps (count(),
min(), max(), etc.)
◎ Global Operations
What insights are we going to gain
◎Who in this tree is the most important?
◎Who in this tree is 6 degrees from Kevin
Bacon?
◎Who in this tree married their first cousin?
1.
Centrality Analysis
Finding Importance
Degree
Centrality
Count the edges
Example - Who is the member of the most families?
g.V().hasLabel('individual')
.project('person', 'degree')
.by('full_name')
.by(bothE('member_of').count())
.order().by(select('degree'), decr).limit(5)
Eigenvector
Centrality
Relative importance matters
.6
.3 .5
.4
.2 .2
.2
Example - Who is in the most important individual?
g.V().hasLabel('individual')
.repeat(
groupCount('m').by('full_name')
.out('member_of').in('member_of')
.timeLimit(100)
).times(5).cap('m')
.order(local).by(values, decr)
.limit(local, 5).next()
PageRank
Similar to the Eigenvector
Centrality but with scaling
25
3
2
5
1
3
2
22
Example - Whose lineage exerts the most influence over this
family tree?
g.V().withComputer().hasLabel('individual')
.pageRank()
.by(bothE('member_of')).by('rank')
.order().by('rank', decr)
.valueMap('full_name', 'rank').limit(5)
Answer
Degree EigenVector PageRank
Name Value
Henry VIII 7
Charlemagne 6
Jan 5
Ferdinand VII 5
Philip II 5
Name Value
Mary 149950
Margret 124221
Henry VIII 107539
Son 90715
Daughter 86961
Name Value
Joan of the
Tower 0.784
Edward III 0.774
Elenor 0.774
John of
Eltham 0.719
Frederick
William III 0.681
And many
more...
Closeness Centrality
Betweeness Centrality
Katz Centrality
Freeman Centrality …...
Practical Examples
◎Who is the most important person in my
family's history?
◎Who in my family history has been the most
prolific?
2.
Path Analysis
Who in this tree is 6 degrees from
Kevin Bacon?
Path
How did you get there?
Simple
Path
Don’t Repeat yourself
Cyclic
Path
Ok then Repeat yourself
Sorry
Not in this family tree
How about this instead?
Example - What long is the lineage between Queen Victoria
and Henry VIII?
SimplePath
g.V('@I1@').repeat(timeLimit(60000)
.out('member_of').in('member_of')
.simplePath()).until(hasId('@I828@'))
.path().limit(1).count(local)
CyclicPath
g.V('@I1@').repeat(timeLimit(60000)
.out('member_of').in('member_of')
.cyclicPath()).until(hasId('@I828@'))
.path().limit(1).count(local)
SimplePath
25 steps
Answer
CyclicPath
27 steps
Practical Examples
◎How am I related to X in my family?
◎Does this family tree contain clusters of
people?
3.
Pattern Detection
Finding what is hidden
Pattern Detection in Gremlin
◎Gremlin has the ability to be imperative
○ g.V().in().out()......
◎Or Declarative
○ g.V().match(
__.as(‘a’).....as(‘b’), //predicate 1
__.as(‘b’).....as(‘c’), //predicate 2
__.as(‘c’).where(‘c’, eq(‘b’)).as(‘c’)
).select(‘b’, ‘c’)
Example - Who is married to their first cousin?
g.V().match(
__.as('e').has('individual','sex','M').as('husband'),
__.as('husband').in('is_spouse').as('wifes'),
__.as('husband').both('is_first_cousin').as('cousin'),
__.as('cousin').where('cousin',eq('wifes')).as('wife')
).select('husband',’wife')
.by('full_name').fold().unfold()
Answer
Husband Wife
1 Albert Augustus Charles Victoria /Hanover/
2 Leopold_I Margaret Teresa
3 Alexander_I the_Fierce Sybil
4 Philip_IV Mariana of_Austria
Practical Examples
◎Merging trees together based on potential
common ancestors using pattern matching
4.
Putting it all together
Example - Which women who married their first cousin had
the greatest number of families?
g.V().match(
__.as('e').has('individual','sex','M').as('husband'),
__.as('husband').in('is_spouse').as('wifes'),
__.as('husband').both('is_first_cousin').as('cousin'),
__.as('cousin').where('cousin',eq('wifes')).as('wife')
).select('wife')
.project('person','degree')
.by('full_name')
.by(bothE('member_of').count())
.order().by(select('degree'), decr).limit(5)
Answer
Wife Degree
1 Victoria /Hanover/ 2
2 Margaret Teresa 3
3 Sybil 4
4 Mariana of_Austria 2
Thanks!
Any questions?
You can find me at:
dave@bechberger.com
@bechbd
www.linkedin.com/in/davebechberger

More Related Content

Recently uploaded

Jax, FL Admin Community Group 05.14.2024 Combined Deck
Jax, FL Admin Community Group 05.14.2024 Combined DeckJax, FL Admin Community Group 05.14.2024 Combined Deck
Jax, FL Admin Community Group 05.14.2024 Combined Deck
Marc Lester
 
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Lisi Hocke
 

Recently uploaded (20)

Jax, FL Admin Community Group 05.14.2024 Combined Deck
Jax, FL Admin Community Group 05.14.2024 Combined DeckJax, FL Admin Community Group 05.14.2024 Combined Deck
Jax, FL Admin Community Group 05.14.2024 Combined Deck
 
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
 
Auto Affiliate AI Earns First Commission in 3 Hours..pdf
Auto Affiliate  AI Earns First Commission in 3 Hours..pdfAuto Affiliate  AI Earns First Commission in 3 Hours..pdf
Auto Affiliate AI Earns First Commission in 3 Hours..pdf
 
Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...
Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...
Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...
 
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
 
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-CloudAlluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
 
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
 
The Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdf
The Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdfThe Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdf
The Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdf
 
Workshop - Architecting Innovative Graph Applications- GraphSummit Milan
Workshop -  Architecting Innovative Graph Applications- GraphSummit MilanWorkshop -  Architecting Innovative Graph Applications- GraphSummit Milan
Workshop - Architecting Innovative Graph Applications- GraphSummit Milan
 
Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?Prompt Engineering - an Art, a Science, or your next Job Title?
Prompt Engineering - an Art, a Science, or your next Job Title?
 
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit MilanWorkshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
 
Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...
Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...
Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...
 
Weeding your micro service landscape.pdf
Weeding your micro service landscape.pdfWeeding your micro service landscape.pdf
Weeding your micro service landscape.pdf
 
^Clinic ^%[+27788225528*Abortion Pills For Sale In harare
^Clinic ^%[+27788225528*Abortion Pills For Sale In harare^Clinic ^%[+27788225528*Abortion Pills For Sale In harare
^Clinic ^%[+27788225528*Abortion Pills For Sale In harare
 
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
 
Food Delivery Business App Development Guide 2024
Food Delivery Business App Development Guide 2024Food Delivery Business App Development Guide 2024
Food Delivery Business App Development Guide 2024
 
Lessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdfLessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdf
 
^Clinic ^%[+27788225528*Abortion Pills For Sale In witbank
^Clinic ^%[+27788225528*Abortion Pills For Sale In witbank^Clinic ^%[+27788225528*Abortion Pills For Sale In witbank
^Clinic ^%[+27788225528*Abortion Pills For Sale In witbank
 
What is a Recruitment Management Software?
What is a Recruitment Management Software?What is a Recruitment Management Software?
What is a Recruitment Management Software?
 
Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Era
 

Featured

Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Featured (20)

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 

Graph Analytics For Fun and Profit

  • 2. Hello! I am David Bechberger Sr. Architect for Data and Analytics at Gene by Gene, a bioinformatics company specializing in genetic genealogy. You can find me at: @bechbd www.linkedin.com/in/davebechberger
  • 3. What we do at Swab Sequence Analysis Insight
  • 4. What this talk isn’t ◎A through review of graph analytic techniques ◎A review of all graph analytic frameworks ◎A deep dive into any of the techniques we discuss
  • 5. What this talk is ◎Where to start with Graph Analytics ◎OLTP and OLAP in Gremlin ◎Practical Examples using …..
  • 6. Family Trees ◎We all have them ◎I know them well ◎They are natural graphs
  • 7. Or more specifically this name owns individual family tree member_of is_known_as is_spouse is_first_cousin
  • 8. Example - Find the names of all family members in a tree T1 F1 I1 Bob F2 I2 I3 I4 Steve Joan Rick owns member_of: Husband member_of: Sonis_known _as is_known _as is_known _as is_known _as member_of: Husband member_of: Wife member_of: Wife
  • 9. Gremlin Example - Finding the names of all family members for tree owner g.V().has(‘tree’, ‘unique_id, ‘T1') .out(‘owns’) .sideEffect( out('is_known_as').properties('full_name') .store('name') ) .out('member_of').in('member_of') .sideEffect( out('is_known_as').properties('full_name') .store('name') ) .cap('name')
  • 10. ◎Tinkerpop supports both ◎Gremlin can be used to query in either ◎But their are differences…. Apache Tinkerpop Gremlin OLTP and OLAP
  • 11. OLTP ◎ Depth First ◎ Lazy Evaluation - Low memory usage ◎ Real-time (ms/sub- sec) Gremlin OLTP versus OLAP OLAP ◎ Breadth First ◎ Eager evaluation - High memory usage ◎ Long Running (min/hour)
  • 12. OLTP ◎ Cannot run certain queries or steps (e.g. pageRank, bulk loading) ◎ Limited time a query ◎ Local operations Limitations OLAP ◎ Some steps are prohibitive like path(), simplePath(), etc. ◎ Barrier Steps (count(), min(), max(), etc.) ◎ Global Operations
  • 13. What insights are we going to gain ◎Who in this tree is the most important? ◎Who in this tree is 6 degrees from Kevin Bacon? ◎Who in this tree married their first cousin?
  • 16. Example - Who is the member of the most families? g.V().hasLabel('individual') .project('person', 'degree') .by('full_name') .by(bothE('member_of').count()) .order().by(select('degree'), decr).limit(5)
  • 18. Example - Who is in the most important individual? g.V().hasLabel('individual') .repeat( groupCount('m').by('full_name') .out('member_of').in('member_of') .timeLimit(100) ).times(5).cap('m') .order(local).by(values, decr) .limit(local, 5).next()
  • 19. PageRank Similar to the Eigenvector Centrality but with scaling 25 3 2 5 1 3 2 22
  • 20. Example - Whose lineage exerts the most influence over this family tree? g.V().withComputer().hasLabel('individual') .pageRank() .by(bothE('member_of')).by('rank') .order().by('rank', decr) .valueMap('full_name', 'rank').limit(5)
  • 21. Answer Degree EigenVector PageRank Name Value Henry VIII 7 Charlemagne 6 Jan 5 Ferdinand VII 5 Philip II 5 Name Value Mary 149950 Margret 124221 Henry VIII 107539 Son 90715 Daughter 86961 Name Value Joan of the Tower 0.784 Edward III 0.774 Elenor 0.774 John of Eltham 0.719 Frederick William III 0.681
  • 22. And many more... Closeness Centrality Betweeness Centrality Katz Centrality Freeman Centrality …...
  • 23. Practical Examples ◎Who is the most important person in my family's history? ◎Who in my family history has been the most prolific?
  • 24. 2. Path Analysis Who in this tree is 6 degrees from Kevin Bacon?
  • 25. Path How did you get there?
  • 28. Sorry Not in this family tree
  • 29. How about this instead?
  • 30. Example - What long is the lineage between Queen Victoria and Henry VIII? SimplePath g.V('@I1@').repeat(timeLimit(60000) .out('member_of').in('member_of') .simplePath()).until(hasId('@I828@')) .path().limit(1).count(local) CyclicPath g.V('@I1@').repeat(timeLimit(60000) .out('member_of').in('member_of') .cyclicPath()).until(hasId('@I828@')) .path().limit(1).count(local)
  • 32. Practical Examples ◎How am I related to X in my family? ◎Does this family tree contain clusters of people?
  • 34. Pattern Detection in Gremlin ◎Gremlin has the ability to be imperative ○ g.V().in().out()...... ◎Or Declarative ○ g.V().match( __.as(‘a’).....as(‘b’), //predicate 1 __.as(‘b’).....as(‘c’), //predicate 2 __.as(‘c’).where(‘c’, eq(‘b’)).as(‘c’) ).select(‘b’, ‘c’)
  • 35. Example - Who is married to their first cousin? g.V().match( __.as('e').has('individual','sex','M').as('husband'), __.as('husband').in('is_spouse').as('wifes'), __.as('husband').both('is_first_cousin').as('cousin'), __.as('cousin').where('cousin',eq('wifes')).as('wife') ).select('husband',’wife') .by('full_name').fold().unfold()
  • 36. Answer Husband Wife 1 Albert Augustus Charles Victoria /Hanover/ 2 Leopold_I Margaret Teresa 3 Alexander_I the_Fierce Sybil 4 Philip_IV Mariana of_Austria
  • 37. Practical Examples ◎Merging trees together based on potential common ancestors using pattern matching
  • 38. 4. Putting it all together
  • 39. Example - Which women who married their first cousin had the greatest number of families? g.V().match( __.as('e').has('individual','sex','M').as('husband'), __.as('husband').in('is_spouse').as('wifes'), __.as('husband').both('is_first_cousin').as('cousin'), __.as('cousin').where('cousin',eq('wifes')).as('wife') ).select('wife') .project('person','degree') .by('full_name') .by(bothE('member_of').count()) .order().by(select('degree'), decr).limit(5)
  • 40. Answer Wife Degree 1 Victoria /Hanover/ 2 2 Margaret Teresa 3 3 Sybil 4 4 Mariana of_Austria 2
  • 41. Thanks! Any questions? You can find me at: dave@bechberger.com @bechbd www.linkedin.com/in/davebechberger

Editor's Notes

  1. Background in nearly 20 years Full Stack development in.NET, C, Java/Scala, and pretty much everything else Switched to working almost exclusively on big data problems several years ago Spent the last few years leveraging graph databases to build out high performance data platforms If you have questions on using .NET and graph databases feel free to come talk to me. Current role is Sr Architect for data and analytics building out our next generation data and analytics platform
  2. As I like to think of this talk as “Things I wish I knew 18 months ago about graphs”
  3. Well known model Going to use a European Royal Family Tree
  4. Based on GEDCOM - 1995 Standard by the LDS church Basically its a linked data structure where all records are atomic units (individual/family/name/note) that contain pointers to each other
  5. Start at a tree Move to the root owner and to their name Traverse out to families Then from families to other individuals and their name
  6. Here is what an example query on our model looks like…. As you can see the basis of this model as it was brought over from GEDCOM can make the queries be more verbose that one would normally strive to in order to retrieve what should be a relatively simple set of data
  7. OLTP -Depth first - serial stream processing to provide depth first traversals into the data. Can be thought of as a stream processor where graph traversers arrive from the left -> an instruction is processed on those traversers -> mutated traversers are sent out the right OLAP - Unlike OLTP queries OLAP queries are breadth first queries meaning that they run in a logically parallel and use message passing to communicate between the messages.
  8. OLTP - Has its limitations , most notably certain complex operations (such as running pageRank, bulking loading, and global operations) which are not allowed or appropriate for a transactional workload OLAP - This scatter/gather methodology allows for working on massive scales of data but also prevents some steps (such as path(), simplePath()) from being executed and others such as order() from being meaningful. It also has the disadvantage that some steps within a gremlin query can require all of the data to be in the same location to process. Steps such as count(), min(), max(), group(), etc. are known as barrier steps and requires that all the data return to a single location to be processed before being sent out to workers. OLTP - Use when your query is going to touch only a portion of the data or a subgraph e.g. Give me the average age of people in my family? OLAP - Use when your query is going to touch all/a significant amount of the data in the graph e.g. Give me the average age of everyone in my family tree?
  9. Centrality Analysis is about determining what is the most important in your graph. This sort of analysis is quite common when performing social network analysis, looking for key infrastructure points and examining biological networks. Unfortunately defining what it means to be important is really dependant on the circumstances. One other important thing to remember is that these sort of algorithms measure the importance of a vertex in a graph which may or may not be correlated to the influence. For finding the most influential nodes in a graph there are other node influence metrics you would want to investigate.
  10. Degree Centrality - a measure of the number of edges associated with a vertex Degree Centrality looks at the number of connections a vertex has and uses that to determine the relative importance. This can be further refined using only inward outward edges. In degree centrality the larger the number the more influential the vertex
  11. Eigenvector Centrality - a measure of the vertex on the graph by using the relative importance of the adjacent vertices to influence the importance of a vertex. I.e. If a node has many edges but is connected to few influential vertices it will be ranked lower than a vertex with fewer edges but the adjacent vertices are more important
  12. PageRank - Made famous by Sergey Brin and Larry Page at Google for ranking web links. It works similar to the Eigenvector Centrality but adds a scaling factor to the results. This algorithm is well documented but far from something you would want to create yourself. Luckily Gremlin has a prebuilt step to help us with this.
  13. The interesting part about this answer is not necessarily the answer itself but the fact that each method produced distinctly different answers Example why you need to understand your question to choose the correct method
  14. Why do these examples matter?
  15. Paths are the walk through the graph defined by a traversal Path object contains All Labels “as(xxxx)” All Vertices All Edges All sideeffects/datastructures Path traversals tend to be on the slow side and they are computationally expensive as the entire path is stored for each traverser. This can expand exponentially as the size of the
  16. Simple path queries are pretty much what it sounds like. Shortest path between two vertices in a graph. Minimize the amount of computation that is required simplePath filters out paths that contain repeats in them. Simple path queries are often useful if you want to find the shortest connection between two things such as in a transportation network, between patterns or subgraphs or in social network analysis
  17. cyclic paths are paths that repeat back on themselves. Using something like a cyclic path can be a first step in trying to detect communities or clusters within your graph
  18. How about we find the quickest lineage between Queen Victoria and Henry VIII instead?
  19. There are a few key things to note here: Where the “until” sits matters when do a repeat. If it is before the “repeat” it is a while/do, if it is after it is a do/while loop Adding a timelimit to you traversal can help prevent a never ending query
  20. If you go in and examine the path objects returned by these two queries you will notice that the difference between the simple and cyclic paths is that the cyclic path circles through her husband Albert to continue on to Henry VIII
  21. Finding how you are related to others in your family tree is a rather straightforward matter of counting the ups and down in generations found by the simplest path. Finding clusters of people in your tree can be used to help identify areas in your tree where familial marriages were common
  22. Gremlin has the ability to work as both an imperative language as well as a declarative language In the imperative model you usually write queries as we described earlier. You start with some stream over vertices -> you then move left to right taking in data -> processing that data -> the emitting the processed data On the other hand the declarative model works using a different approach. In the declarative model the user defines a base set of nodes -> then describes a one or more patterns that the data needs to match. Once submitted to the gremlin engine the engine determines the optimal query to run to find that pattern within the graph One of the neat features of gremlin is that you are able to intermix the two types of syntax within the same traversal. Personally I find writing the declarative syntax powerful but I struggle everyt ime I work with it.
  23. 1.So what we are doing here is first define a predicate containing all the males 2. Next we are defining a predicate containing all of those mens wives 3. We then are defining a predicate containing all of those mens cousins 4. Finally we are matching everyone who is both a cousin and a wife
  24. Yes while I understand this is an interesting traversal from an inquisitive perspective it is also relevant from a genetic aspect as well. Endogamous populations, ones that marry within specific groups, have greater genetic chance of inheriting familial genetic defects than people who marry within the larger population. While cousin marriages were common across many parts of the European Royal family tree, one famous example was Queen Victoria. She married her first cousin Albert. Due to this close relationship between parents several of Queen Victoria's children ended up with hemophilia, which is a genetic defect of on the X chromosome inherited from parents.
  25. Why do these examples matter? Well in our business our customers are very interested in expanding their family trees. If we are able to use pattern matching algorithms to suggest potential matches in other people’s family trees then we are able to quickly and effectively provide them with the ability to expand their family trees.
  26. When it comes to it this is a bit of a strange query. Intermixing these sorts of graph analytical tools to gain more valuable insight into your data This query is also an example of how you can levereage both declarative and imperative syntaxes in the same query.