NDC Oslo 2018 - A Practical Guide to Graph Databases

A Practical Guide to Graph
Databases
About Me
Architect and Full Stack Developer
● 20 years of full stack experience
● Distributed high performance low
latency big data platforms
● Graph Databases are kinda my thing
www.bechbergerconsulting.com
www.bechberger.com
@bechbd
www.linkedin.com/in/davebechberger
Graph Databases
Graph Databases are hot
Graph Theory
What is Graph Datastore?
● Type of NoSQL datastore
● Uses graph structures (nodes, edges)
to store data
● Efficiently represents and traverses
relationships
The NoSQL Spectrum
Why use a graph database?
Network Analysis
Master Data Management
Recommendation Engines
Fraud Detection
Graph Ecosystem
The ecosystem is large and growing
The ecosystem is complex
Frameworks
RDF Triple Stores Labeled Property Model
Databases
Databases vs. Frameworks
Frameworks
● Data is processed not persisted
● Works on enormous datasets
● OLAP workloads
Databases
● Data is persisted and processed
● Real time querying
● OLTP and OLAP workloads
RDF/Triple Stores vs. Labeled Property Graphs
RDF Triple Stores
● Each entity is a triple
● Works with subject - object -
predicate
● Comes from semantic web
● Great for inferring relationships
Labeled Property Graphs
● Entities are a node or an edge
● Works with nodes - edges -
properties - labels
● Both nodes and edges contain
properties
● Great for efficiently traversing
relationships
RDF/Triple Stores vs. Labeled Property Graphs
RDF Triple Stores Labeled Property Graphs
Graph Query Languages
Gremlin
● Imperative +
Declarative
● Powerful
● Steep Learning
Curve
GraphQL
● Most useful for
REST endpoints
● Query Language
for APIs
SPARQL
● W3C Standard
for RDFs
● Based on
semantic Web
Cypher
● Declarative
● Easy to Use
● Most Popular
Language
Others
● Most are
extensions of SQL
● Usually specific to
one system
Queries - Find a Friend of a Friend
SPARQL
PREFIX foaf:
<http://xmlns.com/foaf/0.1/>
SELECT ?name WHERE {
?x foaf:name ?y .
?y foaf:name ?name .}
Cypher
MATCH n (me:Person)-[:FRIEND*2]->
(myFriend:Person) RETURN n.name
Gremlin
g.V().hasLabel(‘person’)
.repeat(out(‘friend’)).times(2)
.dedup().values(‘name’).next()
GraphQL
{
friend {
friend {
name
}
}
}
SQL Variants
SELECT name FROM expand(
bothE('is_friend_with').bothV()
.bothE('is_friend_with').bothV()
)
Both
Visualization
Desktop Tool Web
Visualizations
To use or not to use,
that is the question
Everything is a
Graph
But that doesn’t mean you should solve it with a graph
Explore the
Questions
Search and Selection
● Get me everyone who works at X?
● Find me everyone with a first name like “John”?
● Find me all stores within X miles?
Answer: Use a RDBMS or a Search Server
Related Data
● What is the easiest way for me to be introduced to an executive at X?
● How do “John” and “Paula” know each other?
● How is company X related to company Y?
Answer: Use a Graph
Aggregation
● How many companies are in my system?
● What are my average sales for each day over the past month?
● What is the number of transactions processed by my system each day?
Answer: Use a RDBMS
Pattern Matching
● Who in my system has a similar profile to me?
● Does this transaction look like other known fraudulent transactions?
● Is the user “J. Smith” the same as “Johan S.”?
Answer: It depends, you might use search server or a graph
Clustering, Centrality, and Influence
● Who is the most influential person I am connected with on LinkedIn?
● What equipment in my network will have the largest impact if it breaks?
● What parts tend to fail at the same time?
Answer: Use a graph
Still not sure?
Should I use Graph?
I sold this to Management as a Graph
project so we are using a graph
Based on work by Dr. Denise Gosnell: https://bit.ly/2s0qBC2
I’m still confused
● Do we care about the relationships between entities as
much or more than the entities themselves?
● If I were to model this in a RDBMS would I be writing
queries with multiple (5+) joins or recursive CTE’s to
retrieve my data?
● Is the structure of my data continuously evolving?
● Is my domain a natural fit for a graph?
Can’t I just do this in
SQL?
Northwind Data Models
Give me all products in a category (Search/Selection)
SQL
SELECT c.categoryName, p.productName,
FROM product AS p
INNER JOIN category AS c ON
c.categoryId=p.categoryId
WHERE c.categoryName=’Beverages’
Gremlin
g.V().has(‘category’, ‘categoryName’,
‘Beverages’).as(‘c’).in(‘part_of’)
.as(‘p’).select(‘c’, ‘p’)
.by(‘categoryName’).by(‘productName’)
Cypher
MATCH (o:Category)-[:PARTOF]->(p:Product)
RETURN c.categoryName, p.productName
Give me the top 5 products ordered (Aggregation)
SQL
SELECT TOP(5) c.categoryName,
p.productName, count(o)
FROM order AS o
INNER JOIN product AS p ON
p.productId=o.productId
INNER JOIN category AS c ON
c.categoryId=p.categoryId
ORDER BY count(o)
Gremlin
g.V().hasLabel("order").as(‘o’)
.out(‘orders’).as(‘p’).out(‘part_of’)
.as(‘c’).order().by(select(‘o’).count()).
select(‘c’, ‘p’, ‘o’).by(‘categoryName’)
.by(‘productName’).by(count())
Cypher
MATCH (o:Order)-[:ORDERS]->(p:Product) -
[:PART_OF]->(c:Category)
RETURN c.categoryName, p.productName,
count(o)
ORDER BY count(o)
DESC LIMIT 5
Find Products Purchased by others that I haven’t purchased
(Related Data/Pattern Matching)
SQL
SELECT TOP(5) product.product_name as Recommendation,
count(1) as Frequency
FROM product, customer_product_mapping,
(SELECT cpm3.product_id, cpm3.customer_id
FROM Customer_product_mapping cpm,
Customer_product_mapping cpm2, Customer_product_mapping cpm3
WHERE cpm.customer_id = ‘123’
and cpm.product_id = cpm2.product_id
and cpm2.customer_id != ‘customer-one’
and cpm3.customer_id = cpm2.customer_id
and cpm3.product_id not in (select distinct product_id
FROM Customer_product_mapping cpm
WHERE cpm.customer_id = ‘customer-one’)
) recommended_products
WHERE customer_product_mapping.product_id = product.product_id
and customer_product_mapping.product_id in
recommended_products.product_id
and customer_product_mapping.customer_id =
recommended_products.customer_id
GROUP BY product.product_name
ORDER BY Frequency desc
Gremlin
g.V().has("customer", "customerId", "123").as("c").
out("ordered").out("contains").out("is").aggregate("p").
in("is").in("contains").in("ordered").where(neq("c")).
out("ordered").out("contains").out("is").where(without("p")).
groupCount().order(local).by(values,
decr).select(keys).limit(local, 5).
unfold().values("name")
Cypher
MATCH (u:Customer {customer_id:’123’})-[:BOUGHT]->(p:Product)<-
[:BOUGHT]-(peer:Customer)-[:BOUGHT]->(r:Product)
WHERE not (u)-[:BOUGHT]->(r)
RETURN r as Recommendation, count(*) as Frequency
ORDER BY Frequency DESC LIMIT 5;
Give me all employees, their supervisor and level (Recursive CTE)
SQL
WITH EmployeeHierarchy (EmployeeID,
LastName,
FirstName,
ReportsTo,
HierarchyLevel) AS
( SELECT EmployeeID
, LastName
, FirstName
, ReportsTo
, 1 as HierarchyLevel
FROM Employees
WHERE ReportsTo IS NULL
UNION ALL
SELECT e.EmployeeID
, e.LastName
, e.FirstName
, e.ReportsTo
, eh.HierarchyLevel + 1 AS HierarchyLevel
FROM Employees e
INNER JOIN EmployeeHierarchy eh
ON e.ReportsTo = eh.EmployeeID)
SELECT *
FROM EmployeeHierarchy
ORDER BY HierarchyLevel, LastName, FirstName
Gremlin
g.V().hasLabel("employee").where(__.not(out("reportsTo"))).
repeat(__.in("reportsTo")).emit().tree().by(map
{def employee = it.get() employee.value("firstName") + " " +
employee.value("lastName")}).next()
Cypher
MATCH p = (u:Employee)->[:ReportsTo]->(s:Employee)<-
RETURN u.firstName as FirstName, u.LastName AS LastName,
(s.firstName + " " + s.lastName) AS ReportsTo, path(p) AS
HierarchyLevel ORDER BY HierarchyLevel, LastName, FirstName
Based on work by http://sql2gremlin.com/
Where do I start?
Choosing a Datastore
● Framework vs. RDF vs. Property Model
● HA/Transaction Volume/Data Size
● Hosted vs On Premise
Datastore Concerns
● Data Consistency - ACID or BASE
● Explore your choices
● Beware the Operational Overhead
Data Modelling
● Whiteboard friendly - close to but Pragmatic Conceptual model
● Take into account how you are traversing data
● Use your Relational model to start
● Iterate, Iterate, Iterate
Data Modelling Concerns
● Don’t use Symmetric Relationships
● Look out for Hidden/Anemic Relationships
● Look for Supernodes
● Schema - Use it and make it general
What next?
Summary
The Good
● Graphs are flexible
● Great at finding and traversing relationships
● Natural fit in many complex domains
● Query times are proportional to amount of graph you traverse
The Bad
● Different options scale very differently
● Team needs to learn a new mindset
● Still immature space
The Ugly
● Lack of documentation
● Large, splintered and rapidly evolving ecosystem
● Hard for new users to tell good versus bad use cases
Advice from the trenches...
● Graph datastores may solve your problem, but understand your problem first
● Expect some trial and error
● Your data model will evolve, plan for it
● Don’t underestimate the time it takes to bring your team up to speed
● Graphs databases are not a silver bullet
www.bechbergerconsulting.com
www.bechberger.com
@bechbd
www.linkedin.com/in/davebechberger
Questions?
1 of 47

More Related Content

Featured(20)

ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
Alireza Esmikhani30.2K views
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
Project for Public Spaces & National Center for Biking and Walking6.9K views
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
Erica Santiago25.1K views
9 Tips for a Work-free Vacation9 Tips for a Work-free Vacation
9 Tips for a Work-free Vacation
Weekdone.com7.1K views
I Rock Therefore I Am. 20 Legendary Quotes from PrinceI Rock Therefore I Am. 20 Legendary Quotes from Prince
I Rock Therefore I Am. 20 Legendary Quotes from Prince
Empowered Presentations142.8K views
How to Map Your FutureHow to Map Your Future
How to Map Your Future
SlideShop.com275.1K views
Read with Pride | LGBTQ+ ReadsRead with Pride | LGBTQ+ Reads
Read with Pride | LGBTQ+ Reads
Kayla Martin-Gant1.1K views

NDC Oslo 2018 - A Practical Guide to Graph Databases

  • 1. A Practical Guide to Graph Databases
  • 2. About Me Architect and Full Stack Developer ● 20 years of full stack experience ● Distributed high performance low latency big data platforms ● Graph Databases are kinda my thing www.bechbergerconsulting.com www.bechberger.com @bechbd www.linkedin.com/in/davebechberger
  • 6. What is Graph Datastore? ● Type of NoSQL datastore ● Uses graph structures (nodes, edges) to store data ● Efficiently represents and traverses relationships
  • 8. Why use a graph database? Network Analysis Master Data Management Recommendation Engines Fraud Detection
  • 10. The ecosystem is large and growing
  • 11. The ecosystem is complex Frameworks RDF Triple Stores Labeled Property Model Databases
  • 12. Databases vs. Frameworks Frameworks ● Data is processed not persisted ● Works on enormous datasets ● OLAP workloads Databases ● Data is persisted and processed ● Real time querying ● OLTP and OLAP workloads
  • 13. RDF/Triple Stores vs. Labeled Property Graphs RDF Triple Stores ● Each entity is a triple ● Works with subject - object - predicate ● Comes from semantic web ● Great for inferring relationships Labeled Property Graphs ● Entities are a node or an edge ● Works with nodes - edges - properties - labels ● Both nodes and edges contain properties ● Great for efficiently traversing relationships
  • 14. RDF/Triple Stores vs. Labeled Property Graphs RDF Triple Stores Labeled Property Graphs
  • 15. Graph Query Languages Gremlin ● Imperative + Declarative ● Powerful ● Steep Learning Curve GraphQL ● Most useful for REST endpoints ● Query Language for APIs SPARQL ● W3C Standard for RDFs ● Based on semantic Web Cypher ● Declarative ● Easy to Use ● Most Popular Language Others ● Most are extensions of SQL ● Usually specific to one system
  • 16. Queries - Find a Friend of a Friend SPARQL PREFIX foaf: <http://xmlns.com/foaf/0.1/> SELECT ?name WHERE { ?x foaf:name ?y . ?y foaf:name ?name .} Cypher MATCH n (me:Person)-[:FRIEND*2]-> (myFriend:Person) RETURN n.name Gremlin g.V().hasLabel(‘person’) .repeat(out(‘friend’)).times(2) .dedup().values(‘name’).next() GraphQL { friend { friend { name } } } SQL Variants SELECT name FROM expand( bothE('is_friend_with').bothV() .bothE('is_friend_with').bothV() )
  • 19. To use or not to use, that is the question
  • 20. Everything is a Graph But that doesn’t mean you should solve it with a graph
  • 22. Search and Selection ● Get me everyone who works at X? ● Find me everyone with a first name like “John”? ● Find me all stores within X miles? Answer: Use a RDBMS or a Search Server
  • 23. Related Data ● What is the easiest way for me to be introduced to an executive at X? ● How do “John” and “Paula” know each other? ● How is company X related to company Y? Answer: Use a Graph
  • 24. Aggregation ● How many companies are in my system? ● What are my average sales for each day over the past month? ● What is the number of transactions processed by my system each day? Answer: Use a RDBMS
  • 25. Pattern Matching ● Who in my system has a similar profile to me? ● Does this transaction look like other known fraudulent transactions? ● Is the user “J. Smith” the same as “Johan S.”? Answer: It depends, you might use search server or a graph
  • 26. Clustering, Centrality, and Influence ● Who is the most influential person I am connected with on LinkedIn? ● What equipment in my network will have the largest impact if it breaks? ● What parts tend to fail at the same time? Answer: Use a graph
  • 28. Should I use Graph? I sold this to Management as a Graph project so we are using a graph Based on work by Dr. Denise Gosnell: https://bit.ly/2s0qBC2
  • 29. I’m still confused ● Do we care about the relationships between entities as much or more than the entities themselves? ● If I were to model this in a RDBMS would I be writing queries with multiple (5+) joins or recursive CTE’s to retrieve my data? ● Is the structure of my data continuously evolving? ● Is my domain a natural fit for a graph?
  • 30. Can’t I just do this in SQL?
  • 32. Give me all products in a category (Search/Selection) SQL SELECT c.categoryName, p.productName, FROM product AS p INNER JOIN category AS c ON c.categoryId=p.categoryId WHERE c.categoryName=’Beverages’ Gremlin g.V().has(‘category’, ‘categoryName’, ‘Beverages’).as(‘c’).in(‘part_of’) .as(‘p’).select(‘c’, ‘p’) .by(‘categoryName’).by(‘productName’) Cypher MATCH (o:Category)-[:PARTOF]->(p:Product) RETURN c.categoryName, p.productName
  • 33. Give me the top 5 products ordered (Aggregation) SQL SELECT TOP(5) c.categoryName, p.productName, count(o) FROM order AS o INNER JOIN product AS p ON p.productId=o.productId INNER JOIN category AS c ON c.categoryId=p.categoryId ORDER BY count(o) Gremlin g.V().hasLabel("order").as(‘o’) .out(‘orders’).as(‘p’).out(‘part_of’) .as(‘c’).order().by(select(‘o’).count()). select(‘c’, ‘p’, ‘o’).by(‘categoryName’) .by(‘productName’).by(count()) Cypher MATCH (o:Order)-[:ORDERS]->(p:Product) - [:PART_OF]->(c:Category) RETURN c.categoryName, p.productName, count(o) ORDER BY count(o) DESC LIMIT 5
  • 34. Find Products Purchased by others that I haven’t purchased (Related Data/Pattern Matching) SQL SELECT TOP(5) product.product_name as Recommendation, count(1) as Frequency FROM product, customer_product_mapping, (SELECT cpm3.product_id, cpm3.customer_id FROM Customer_product_mapping cpm, Customer_product_mapping cpm2, Customer_product_mapping cpm3 WHERE cpm.customer_id = ‘123’ and cpm.product_id = cpm2.product_id and cpm2.customer_id != ‘customer-one’ and cpm3.customer_id = cpm2.customer_id and cpm3.product_id not in (select distinct product_id FROM Customer_product_mapping cpm WHERE cpm.customer_id = ‘customer-one’) ) recommended_products WHERE customer_product_mapping.product_id = product.product_id and customer_product_mapping.product_id in recommended_products.product_id and customer_product_mapping.customer_id = recommended_products.customer_id GROUP BY product.product_name ORDER BY Frequency desc Gremlin g.V().has("customer", "customerId", "123").as("c"). out("ordered").out("contains").out("is").aggregate("p"). in("is").in("contains").in("ordered").where(neq("c")). out("ordered").out("contains").out("is").where(without("p")). groupCount().order(local).by(values, decr).select(keys).limit(local, 5). unfold().values("name") Cypher MATCH (u:Customer {customer_id:’123’})-[:BOUGHT]->(p:Product)<- [:BOUGHT]-(peer:Customer)-[:BOUGHT]->(r:Product) WHERE not (u)-[:BOUGHT]->(r) RETURN r as Recommendation, count(*) as Frequency ORDER BY Frequency DESC LIMIT 5;
  • 35. Give me all employees, their supervisor and level (Recursive CTE) SQL WITH EmployeeHierarchy (EmployeeID, LastName, FirstName, ReportsTo, HierarchyLevel) AS ( SELECT EmployeeID , LastName , FirstName , ReportsTo , 1 as HierarchyLevel FROM Employees WHERE ReportsTo IS NULL UNION ALL SELECT e.EmployeeID , e.LastName , e.FirstName , e.ReportsTo , eh.HierarchyLevel + 1 AS HierarchyLevel FROM Employees e INNER JOIN EmployeeHierarchy eh ON e.ReportsTo = eh.EmployeeID) SELECT * FROM EmployeeHierarchy ORDER BY HierarchyLevel, LastName, FirstName Gremlin g.V().hasLabel("employee").where(__.not(out("reportsTo"))). repeat(__.in("reportsTo")).emit().tree().by(map {def employee = it.get() employee.value("firstName") + " " + employee.value("lastName")}).next() Cypher MATCH p = (u:Employee)->[:ReportsTo]->(s:Employee)<- RETURN u.firstName as FirstName, u.LastName AS LastName, (s.firstName + " " + s.lastName) AS ReportsTo, path(p) AS HierarchyLevel ORDER BY HierarchyLevel, LastName, FirstName Based on work by http://sql2gremlin.com/
  • 36. Where do I start?
  • 37. Choosing a Datastore ● Framework vs. RDF vs. Property Model ● HA/Transaction Volume/Data Size ● Hosted vs On Premise
  • 38. Datastore Concerns ● Data Consistency - ACID or BASE ● Explore your choices ● Beware the Operational Overhead
  • 39. Data Modelling ● Whiteboard friendly - close to but Pragmatic Conceptual model ● Take into account how you are traversing data ● Use your Relational model to start ● Iterate, Iterate, Iterate
  • 40. Data Modelling Concerns ● Don’t use Symmetric Relationships ● Look out for Hidden/Anemic Relationships ● Look for Supernodes ● Schema - Use it and make it general
  • 43. The Good ● Graphs are flexible ● Great at finding and traversing relationships ● Natural fit in many complex domains ● Query times are proportional to amount of graph you traverse
  • 44. The Bad ● Different options scale very differently ● Team needs to learn a new mindset ● Still immature space
  • 45. The Ugly ● Lack of documentation ● Large, splintered and rapidly evolving ecosystem ● Hard for new users to tell good versus bad use cases
  • 46. Advice from the trenches... ● Graph datastores may solve your problem, but understand your problem first ● Expect some trial and error ● Your data model will evolve, plan for it ● Don’t underestimate the time it takes to bring your team up to speed ● Graphs databases are not a silver bullet

Editor's Notes

  1. Test text for sizing
  2. Not an architect that just draws boxes and lines, I get my hands dirty by actually helping to build these things
  3. Graph database popularity is up almost 800% since January of 2013
  4. Leohard Euler - 1735 - 7 Bridges of Koingsberg 2 Islands in Pregel River w/ 7 bridges Can you walk all bridges and return to start w/o repeating A knowledge of Graph Theory may help but is not required
  5. Lots of examples out there as to why use a graph database but these are just a few
  6. The ecosystem is large and Growing This slide currently shows 43. I originally put this out on Twitter and immediately had ~ 10 more additions of datastores I had never heard of
  7. Lots of options out there SPARQL is a Standard for RDF graphs, there is not one for Property Model Graphs There is a movement out there called GQL to attempt to create a standard property model graph language
  8. There are lots of tools to help you visualize your data Don’t fall into the trap that the only way to view your data is as a node chart
  9. There are lots of tools to help you visualize your data Don’t fall into the trap that the only way to view your data is as a node chart
  10. Graphs are flexible. In general it is easy to extend your model with additional attributes and objects allowing data evolution at a rapid pace Graphs are great for searching relationships between items, but make sure that's what you want to search Graphs are a more natural data model in many domains Graph processing times are proportional to the amount of nodes and edges you choose to traverse, not the data size
  11. Depending on the graph datastore, they scale differently in terms of transactions and data size, many are single server only It is a different mindset your team has to learn, and learning is not a cheap process Graph databases are still not as mature as RDBMS systems
  12. Their is a lot of documentation for neophyte and expert users, not much in between The ecosystem is vast, splintered and constantly evolving. Graph databases are great for some use cases, horrible for others and it's not always easy to tell which you are in