Introduction to Graph Databases


Published on

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • * graph db usage poll
  • * Six degrees game * Relational databases can't easily answer certain types of questions
  • * first pass using a relational database * cast table: actor_name, movie_title * hard to visualize the solution * In order to do this, you need to do multiple passes or joins
  • * Each degree adds a join * Increases complexity * Decreases performance * Stop when the actor you're looking for is in the list
  • * this problem highlights the ugly truth about RDBs * they weren't designed to handle these types of problems. * arbitrary path query * RDB relationships join data, but are not data in themselves * Set math * Gather everything in the set that matches these criteria, then tell me if this thing is in the set * 1 set, no problem * 2nd set no problem * 3rd set not related to 1st * 4th not related to 2nd * 5th related to 1st and 4th * etc. * Relationships are only available between overlapping sets
  • * disjoint sets
  • * Graphs * Not X-Y * Computer Science definition of graphs * A graph is an ordered pair  G = (V, E)  where V is a set of  vertices  and E is a set of  edges , which are pairs of vertices. * Node : vertex * Relationship : edge * Property : meta-datum attached to a node or relationship * Nodes can have arbitrary properties * Relationships are first-class citizens Have a type Have properties Have a direction Domain semantics Traversable in any direction * This is how graph dbs solve the problems that RDBs can't * Path : an ordered list of nodes and relationships * Paths are found using traversal algorithms
  • * Tree data-structures * Networks * Maps * vehicles on streets == packets through network * Relational databases are graphs!
  • * Make each record a node * Make every foreign key a relationship * RDB indexes are usually stored in a tree structure * Trees are graphs * Why not use RDBs? * The trouble with RDBs is how they are stored in memory and queried   * Require a translation step from memory blocks to graph structure * Relationships not first-class citizens * Many problem domains map poorly to rows/tables
  • * Big Data ** billions of nodes and relationships in a single instance * "Internet of Things" buzzword * Social networking - friends of friends of friends of friends * Assembly/Manufacturing - 1 widget contains 3 gadgets each contain 2 gizmos * Map directions - starting at my house find a route to the office that goes past the pub * Multi-tenancy - root node per tenant * all queries start at root * No overlap between graphs = no accidental data spillage * Fraud: track transactions back to origination * Pretty much anything that can be drawn on a whiteboard
  • * Example: retail system * Customer makes Order * Store sells Order * Order contains Items * Supplier supplied Items * Customer rates Items * Did this customer rank supplier X highly? * Which suppliers sell the highest rated items? * Does item A get rated higher when ordered with Item B? * All can be answered with RDBs as well * Not as elegant * Not as performant
  • * This is where the power of graph dbs comes from * Paths - find any relationship chain between A and B * Kevin Bacon example, known start and end * Traversal - filter out paths that don't meet criteria * Complex path finding, base next decision on existing path from start to current position * Define path-finding (prune) and result filtering functions * Queries - Here is what I want, find it however you can * SPARQL, Gremlin, Cypher
  • * Actors are nodes * Movies are nodes * Relationship: Actor is IN a movie * pseudo-code shortened for brevity * Compare to degree selection join queries
  • * Cypher is "what to find" * describe the "shape" of the thing you're looking for * Very white-board friendly * Pros: easy to understand, query looks like domain model * Cons: not as powerful, not fully featured (YET) * result set is an array of arrays 
  • * RDBs are really good at data aggregation * Set math, duh * Have to traverse the whole graph in order to do aggregation * Truly tabular means not a lot of relationships between the data types
  • * billions of nodes and relationships in a single instance * cluster replication * transactions * native bindings for Ruby, Python, and language that can run in JVM * Licensing
  • Introduction to Graph Databases

    1. 1. Introduction to Graph Databases Josh Adell <> 20110806
    2. 2. Who am I? <ul><ul><li>Software developer: PHP, Javascript, SQL </li></ul></ul><ul><ul><li> </li></ul></ul><ul><ul><li>Fan of using the right tool for the job </li></ul></ul>
    3. 3. The Problem
    4. 4. The Solution? <ul><li>> -- Given &quot;Keanu Reeves&quot; find a connection to &quot;Kevin Bacon&quot; </li></ul><ul><li>> SELECT ??? FROM cast WHERE ??? </li></ul><ul><li>+---------------------------------------------------------------------+ </li></ul><ul><li>| actor_name                 | movie_title                            | </li></ul><ul><li>+============================+========================================+ </li></ul><ul><li>| Jennifer Connelley         | Higher Learning                        | </li></ul><ul><li>+----------------------------+----------------------------------------+ </li></ul><ul><li>| Laurence Fishburne         | Mystic River                           | </li></ul><ul><li>+----------------------------+----------------------------------------+ </li></ul><ul><li>| Laurence Fishburne         | Higher Learning                        | </li></ul><ul><li>+----------------------------+----------------------------------------+ </li></ul><ul><li>| Kevin Bacon                | Mystic River                           | </li></ul><ul><li>+----------------------------+----------------------------------------+ </li></ul><ul><li>| Keanu Reeves               | The Matrix                             | </li></ul><ul><li>+----------------------------+----------------------------------------+ </li></ul><ul><li>| Laurence Fishburne         | The Matrix                             | </li></ul><ul><li>+----------------------------+----------------------------------------+ </li></ul>
    5. 5. Find Every Actor at Each Degree <ul><li>> -- First degree </li></ul><ul><li>> SELECT actor_name FROM cast WHERE movie_title IN (SELECT DISTINCT movie_title FROM cast WHERE actor_name='Kevin Bacon') </li></ul><ul><li>> -- Second degree </li></ul><ul><li>> SELECT actor_name FROM cast WHERE movie_title IN (SELECT DISTINCT movie_title FROM cast WHERE actor_name IN (SELECT actor_name FROM cast WHERE movie_title IN (SELECT DISTINCT movie_title FROM cast WHERE actor_name='Kevin Bacon'))) </li></ul><ul><li>> -- Third degree </li></ul><ul><li>> SELECT actor_name FROM cast WHERE movie_title IN(SELECT DISTINCT movie_title FROM cast WHERE actor_name IN (SELECT actor_name FROM cast WHERE movie_title IN (SELECT DISTINCT movie_title FROM cast WHERE actor_name IN (SELECT actor_name FROM cast WHERE movie_title IN (SELECT DISTINCT movie_title FROM cast WHERE actor_name='Kevin Bacon')))) </li></ul>
    6. 6. The Truth <ul><li>Relational databases aren't very good with relationsh ips </li></ul>Data RDBMs
    7. 7. The Real Problem <ul><li>Finding relationships across multiple degrees of separation </li></ul><ul><li>    ...and across multiple data types </li></ul><ul><li>    ...and where you don't even know there is a relationship </li></ul>
    8. 8. The Real Solution
    9. 9. Graph Examples
    10. 10. Relational Databases are Graphs!
    11. 11. Some Graph Use Cases <ul><ul><li>Social networking </li></ul></ul><ul><ul><li>Manufacturing </li></ul></ul><ul><ul><li>Mapping and Geolocation </li></ul></ul><ul><ul><li>Bioinformatics </li></ul></ul><ul><ul><li>Fraud detection </li></ul></ul><ul><ul><li>Multi-tenancy </li></ul></ul>
    12. 12. Modelling a Domain with Graphs <ul><ul><li>Graphs are &quot;whiteboard-friendly&quot; </li></ul></ul><ul><ul><li>Nouns become nodes </li></ul></ul><ul><ul><li>Verbs become relationships </li></ul></ul><ul><ul><li>Properties are adjectives and adverbs </li></ul></ul>
    13. 13. Graph Mining <ul><ul><li>Paths </li></ul></ul><ul><ul><li>Traversals </li></ul></ul><ul><ul><li>Ad-hoc Queries </li></ul></ul>
    14. 14. New Solution to the Bacon Problem $keanu = $actorIndex->find('name', 'Keanu Reeves'); $kevin = $actorIndex->find('name', 'Kevin Bacon'); $path = $keanu->findPathTo($kevin);
    15. 15. Cypher <ul><ul><li>&quot;What to find&quot; vs. &quot;How to find&quot; </li></ul></ul>// Find all the directors who have directed a movie scored by John Williams // that starred Kevin Bacon START actor=(actors, 'Kevin Bacon'), composer=(compsers, 'John Williams') MATCH (actor)-[:IN]->(movie)<-[:DIRECTED]-(director),       (movie)<-[:SCORED]-(composer) RETURN director
    16. 16. Are RDBs Useful At All? <ul><ul><li>Aggregation </li></ul></ul><ul><ul><li>Ordered data </li></ul></ul><ul><ul><li>Truly tabular data </li></ul></ul><ul><ul><li>Few or clearly defined relationships </li></ul></ul>
    17. 17. <ul><ul><li>Neo Technologies </li></ul></ul><ul><ul><li> </li></ul></ul><ul><ul><li>Embedded in Java applications </li></ul></ul><ul><ul><li>Standalone server via REST </li></ul></ul><ul><ul><li>Plugins: spatial, lucene, rdf </li></ul></ul><ul><li>Others: </li></ul><ul><ul><li>Tinkerpop </li></ul></ul><ul><ul><li>OrientDB </li></ul></ul>
    18. 18. Questions?
    19. 19. Resources <ul><ul><li> </li></ul></ul><ul><ul><li> </li></ul></ul><ul><ul><li> </li></ul></ul><ul><ul><ul><li>Emil Eifrem (Neo Tech. CEO) webinar </li></ul></ul></ul><ul><ul><ul><li>Check out around the 54 minute mark </li></ul></ul></ul><ul><ul><li> </li></ul></ul><ul><ul><li> </li></ul></ul><ul><ul><li>[email_address] </li></ul></ul><ul><ul><li>@josh_adell </li></ul></ul><ul><ul><li>Google+, Facebook, LinkedIn </li></ul></ul>