Graph Databases
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Graph Databases

on

  • 3,299 views

 

Statistics

Views

Total Views
3,299
Views on SlideShare
3,283
Embed Views
16

Actions

Likes
4
Downloads
58
Comments
0

2 Embeds 16

https://twitter.com 14
http://twitter.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • * graph db usage poll
  • * Six degrees game * Relational databases can't easily answer certain types of questions
  • * first pass using a relational database * cast table: actor_name, movie_title * hard to visualize the solution * In order to do this, you need to do multiple passes or joins
  • * Each degree adds a join * Increases complexity * Decreases performance * Stop when the actor you're looking for is in the list
  • * this problem highlights the ugly truth about RDBs * they weren't designed to handle these types of problems. * RDB relationships join data, but are not data in themselves
  • * Gather everything in the set that matches these criteria, then tell me if this thing is in the set * 1 set, no problem * 2nd set no problem * 3rd set not related to 1st * 4th not related to 2nd * 5th related to 1st and 4th * etc. * Relationships are only available between overlapping sets
  • * disjoint sets
  • * Graphs * Not X-Y * Computer Science definition of graphs
  • * graph theory
  • * Nodes can have arbitrary properties * Relationships can have arbitrary properties * Paths are found using traversal algorithms * Indexes help find starting points
  • * This is how graph dbs solve the problems that RDBs can't
  • * Tree data-structures * Networks * Maps * vehicles on streets == packets through network
  • * Make each record a node * Make every foreign key a relationship * RDB indexes are usually stored in a tree structure * Trees are graphs * Why not use RDBs? * The trouble with RDBs is how they are stored in memory and queried   * Require a translation step from memory blocks to graph structure * Relationships not first-class citizens * Many problem domains map poorly to rows/tables
  • * Actors are nodes * Movies are nodes * Relationship: Actor is IN a movie * pseudo-code shortened for brevity * Compare to degree selection join queries
  • * Social networking - friends of friends of friends of friends * Assembly/Manufacturing - 1 widget contains 3 gadgets each contain 2 gizmos * Map directions - starting at my house find a route to the office that goes past the pub * Multi-tenancy - root node per tenant * all queries start at root * No overlap between graphs = no accidental data spillage * Fraud: track transactions back to origination * Pretty much anything that can be drawn on a whiteboard
  • * Example: retail system * Customer makes Order * Store sells Order * Order contains Items * Supplier supplied Items * Customer rates Items * Did this customer rank supplier X highly? * Which suppliers sell the highest rated items? * Does item A get rated higher when ordered with Item B? * All can be answered with RDBs as well * Not as elegant * Not as performant
  • * Recreate Google+
  • * billions of nodes and relationships in a single instance * cluster replication * transactions * native bindings for Ruby, Python, and language that can run in JVM * Licensing * Neo4jPHP - Josh's REST client, no affiliated with Neo Technologies
  • * Index can be saved separately * Or it is saved on `add` * Note that indexes don't have to be on real properties or values
  • * This is where the power of graph dbs comes from * Paths - find any relationship chain between A and B * Traversal - filter out paths that don't meet criteria * Queries - Here is what I want, find it however you can
  • * Paths deal with two known nodes * start and end point * This is the Kevin Bacon example, but with multiple datatypes  * Path can be treated as an array of nodes or relationships * findPathsTo() returns a PathFinder which can have further restrictions placed on it
  • * Written in Javascript * plugins provide other languages: Groovy, Python * Anything that runs on JVM * Path object, check apidocs * inline edit/update/delete * explicit prune evaluator of maxDepth = 1 unless overriden * built in prune: none * built in return: all or all-but-start * Prune: should we continue doen this path? Return: Should we return the entity at this position? * You can return things and still continue traversing * Pros: expressive, powerful, complex search behaviors, in-line edit/update * Cons: complex to write, complex to understand (query languages make this better)
  • * Not very familiar with it * Just mentioning it's out there
  • * Cypher is "what to find" * describe the "shape" of the thing you're looking for * Very white-board friendly * Pros: easy to understand, query looks like domain model * Cons: not as powerful, not fully featured (YET) * result set is an array of arrays 
  • * Three parts ** Where to start ** Shape to find   ** possibly qualifiers ** What to return
  • * If there could be more than one relationship type, could further constrain by ratings 
  • * Webadmin built into neo4j server
  • * RDBs are really good at data aggregation * Set math, duh * Have to traverse the whole graph in order to do aggregation * Truly tabular means not a lot of relationships between the data types

Graph Databases Presentation Transcript

  • 1. Graph Databases Josh Adell <josh.adell@gmail.com> 20110719
  • 2. Who am I?
      • Software developer: PHP, Javascript, SQL
      • http://www.dunnwell.com
      • Fan of using the right tool for the job
  • 3. The Problem
  • 4. The Solution?
    • > -- Given &quot;Keanu Reeves&quot; find a connection to &quot;Kevin Bacon&quot;
    • > SELECT ??? FROM cast WHERE ???
    • +---------------------------------------------------------------------+
    • | actor_name                 | movie_title                            |
    • +============================+========================================+
    • | Jennifer Connelley         | Higher Learning                        |
    • +----------------------------+----------------------------------------+
    • | Laurence Fishburne         | Mystic River                           |
    • +----------------------------+----------------------------------------+
    • | Laurence Fishburne         | Higher Learning                        |
    • +----------------------------+----------------------------------------+
    • | Kevin Bacon                | Mystic River                           |
    • +----------------------------+----------------------------------------+
    • | Keanu Reeves               | The Matrix                             |
    • +----------------------------+----------------------------------------+
    • | Laurence Fishburne         | The Matrix                             |
    • +----------------------------+----------------------------------------+
  • 5. Find Every Actor at Each Degree
    • > -- First degree
    • > SELECT actor_name FROM cast WHERE movie_title IN (SELECT DISTINCT movie_title FROM cast WHERE actor_name='Kevin Bacon')
    • > -- Second degree
    • > SELECT actor_name FROM cast WHERE movie_title IN (SELECT DISTINCT movie_title FROM cast WHERE actor_name IN (SELECT actor_name FROM cast WHERE movie_title IN (SELECT DISTINCT movie_title FROM cast WHERE actor_name='Kevin Bacon')))
    • > -- Third degree
    • > SELECT actor_name FROM cast WHERE movie_title IN(SELECT DISTINCT movie_title FROM cast WHERE actor_name IN (SELECT actor_name FROM cast WHERE movie_title IN (SELECT DISTINCT movie_title FROM cast WHERE actor_name IN (SELECT actor_name FROM cast WHERE movie_title IN (SELECT DISTINCT movie_title FROM cast WHERE actor_name='Kevin Bacon'))))
  • 6. The Truth
    • Relational databases aren't very good with relationsh ips
    Data RDBMs
  • 7. RDBs Use Set Math
  • 8. The Real Problem
    • Finding relationships across multiple degrees of separation
    •     ...and across multiple data types
    •     ...and where you don't even know there is a relationship
  • 9. The Real Solution
  • 10. Computer Science Definition
    • A graph is an ordered pair G = (V, E) where V is a set of vertices and E is a set of edges , which are pairs of vertices.
  • 11. Some Graph DB Vocabulary
      • Node : vertex
      • Relationship : edge
      • Property : meta-datum attached to a node or relationship
      • Path : an ordered list of nodes and relationships
      • Index : node or relationship lookup table
  • 12. Relationships are First-Class Citizens
      • Have a type
      • Have properties
      • Have a direction
        • Domain semantics
        • Traversable in any direction
  • 13. Graph Examples
  • 14. Relational Databases are Graphs!
  • 15. New Solution to the Bacon Problem $keanu = $actorIndex->find('name', 'Keanu Reeves'); $kevin = $actorIndex->find('name', 'Kevin Bacon'); $path = $keanu->findPathTo($kevin);
  • 16. Some Graph Use Cases
      • Social networking
      • Manufacturing
      • Map directions
      • Fraud detection
      • Multi-tenancy
  • 17. Modelling a Domain with Graphs
      • Graphs are &quot;whiteboard-friendly&quot;
      • Nouns become nodes
      • Verbs become relationships
      • Properties are adjectives and adverbs
  • 18. Audience Participation!
  • 19.
      • Neo Technologies
      • http://neo4j.org
      • Embedded in Java applications
      • Standalone server via REST
      • Plugins: spatial, lucene, rdf
      • http://github.com/jadell/Neo4jPHP
  • 20. Using the REST client
    • $client = new Client(new Transport());
    • $customer = new Node($client);
    • $customer->setProperty('name', 'Josh')->save();
    • $store = new Node($client);
    • $store->setProperty('name', 'Home Despot')
    •       ->setProperty('location', 'Durham, NC')->save();
    • $order = new Node($client);
    • $order->save();
    • $item = new Node($client);
    • $item->setProperty('item_number', 'Q32-ESM')->save();
    • $order->relateTo($item, 'CONTAINS')->save();
    • $customer->relateTo($order, 'BOUGHT')->save();
    • $store->relateTo($order, 'SOLD')->save();
    • $customerIndex = new Index($client, Index::TypeNode, 'customers');
    • $customerIndex->add($customer, 'name', $customer->getProperty('name'));
    • $customerIndex->add($customer, 'rating', 'A++');
  • 21. Graph Mining
      • Paths
      • Traversals
      • Ad-hoc Queries
  • 22. Path Finding
      • Find any connection from node A to node B
      • Limit by relationship types and/or direction
      • Path finding algorithms: all, simple, shortest, Dijkstra
    $customer = $customerIndex->findOne('name', 'Josh'); $item = $itemIndex->findOne('item_number', 'Q32-ESM'); $path = $item->findPathsTo($customer)               ->setMaxDepth(2)               ->getSinglePath(); foreach ($path as $node) {     echo $node->getId() . &quot;n&quot;; }
  • 23. Traversal
      • Complex/Custom path finding
      • Base next decision on previous path
    $traversal = new Traversal($client); $traversal ->setOrder(Traversal::OrderDepthFirst) ->setUniqueness(Traversal::UniquenessNodeGlobal) ->setPruneEvaluator('javascript','(function traverse(pos) {       if (pos.length() == 1 && pos.lastRelationship.getType() == &quot;CONTAINS&quot;) {         return false;     } else if (pos.length() == 2 && pos.lastRelationship.getType() == &quot;BOUGHT&quot;) {         return false;      }     return true;})(position)') ->setReturnFilter('javascript',      'return position.endNode().getProperty('type') == 'Customer;'); $customers = $traversal->getResults($item, Traversal::ReturnTypeNode);
  • 24.
      • Uses mathematical notation approach
      • Complex traversal behaviors, including backtracking
      • https://github.com/tinkerpop/gremlin/wiki
    • m = [:]
    • g.v(1).out('likes').in('likes').out('likes').groupCount(m)
    • m.sort{a,b -> a.value <=> b.value}
  • 25. Cypher
      • &quot;What to find&quot; vs. &quot;How to find&quot;
    $query = 'START item=(1) MATCH (item)<-[:CONTAINS]-(order)<-[:BOUGHT]-(customer) RETURN customer'; $cypher = new CypherQuery($client, $query); $customers = $cypher->getResultSet();
  • 26. Cypher Syntax
    • START item = (1)                        START item = (1,2,3)
    • START item = (items, 'name:Q32*')       START item = (1), customer = (2,3)
    • MATCH (item)<--(order)                  MATCH (order)-->(item)
    • MATCH (order)-[r]->(item)                                MATCH ()--(item)
    • MATCH
    •      (supplier)-[:SUPPLIES]->(item)<-[:CONTAINS]-(order),
    •     (customer)-[:RATED]->(item)
    • WHERE customer.name = 'Josh' and s.coupon = 'freewidget'
    • RETURN item, order                      RETURN customer, item, r.rating
    • RETURN r~TYPE                                                        RETURN COUNT(*)
    • ORDER BY customer.name DESC             RETURN AVG(r.rating)
    • LIMIT 3 SKIP 2
  • 27. Cypher - All Together Now
    • // Find the top 10 `widget` ratings by customers who bought AND rated
    • // `widgets`, and the supplier
    • START item = (items, 'name:widget')
    • MATCH (item)<--(order)<--(customer)-[r:RATED]->(item)<--(supplier)
    • RETURN customer, r.rating, supplier ORDER BY r.rating DESC LIMIT 10
  • 28. Tools
      • Neoclipse
      • Webadmin
  • 29. Are RDBs Useful At All?
      • Aggregation
      • Ordered data
      • Truly tabular data
      • Few or clearly defined relationships
  • 30. Questions?
  • 31. Resources
      • http://neo4j.org
      • http://docs.neo4j.org
      • http://www.youtube.com/watch?v=UodTzseLh04
        • Emil Eifrem (Neo Tech. CEO) webinar
        • Check out around the 54 minute mark
      • http://github.com/jadell/Neo4jPHP
      • http://joshadell.com
      • [email_address]
      • @josh_adell
      • Google+, Facebook, LinkedIn