Upcoming SlideShare
×

# Graph Databases

3,157 views

Published on

5 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
3,157
On SlideShare
0
From Embeds
0
Number of Embeds
18
Actions
Shares
0
64
0
Likes
5
Embeds 0
No embeds

No notes for slide
• * graph db usage poll
• * Six degrees game * Relational databases can&apos;t easily answer certain types of questions
• * first pass using a relational database * cast table: actor_name, movie_title * hard to visualize the solution * In order to do this, you need to do multiple passes or joins
• * Each degree adds a join * Increases complexity * Decreases performance * Stop when the actor you&apos;re looking for is in the list
• * this problem highlights the ugly truth about RDBs * they weren&apos;t designed to handle these types of problems. * RDB relationships join data, but are not data in themselves
• * Gather everything in the set that matches these criteria, then tell me if this thing is in the set * 1 set, no problem * 2nd set no problem * 3rd set not related to 1st * 4th not related to 2nd * 5th related to 1st and 4th * etc. * Relationships are only available between overlapping sets
• * disjoint sets
• * Graphs * Not X-Y * Computer Science definition of graphs
• * graph theory
• * Nodes can have arbitrary properties * Relationships can have arbitrary properties * Paths are found using traversal algorithms * Indexes help find starting points
• * This is how graph dbs solve the problems that RDBs can&apos;t
• * Tree data-structures * Networks * Maps * vehicles on streets == packets through network
• * Make each record a node * Make every foreign key a relationship * RDB indexes are usually stored in a tree structure * Trees are graphs * Why not use RDBs? * The trouble with RDBs is how they are stored in memory and queried   * Require a translation step from memory blocks to graph structure * Relationships not first-class citizens * Many problem domains map poorly to rows/tables
• * Actors are nodes * Movies are nodes * Relationship: Actor is IN a movie * pseudo-code shortened for brevity * Compare to degree selection join queries
• * Social networking - friends of friends of friends of friends * Assembly/Manufacturing - 1 widget contains 3 gadgets each contain 2 gizmos * Map directions - starting at my house find a route to the office that goes past the pub * Multi-tenancy - root node per tenant * all queries start at root * No overlap between graphs = no accidental data spillage * Fraud: track transactions back to origination * Pretty much anything that can be drawn on a whiteboard
• * Example: retail system * Customer makes Order * Store sells Order * Order contains Items * Supplier supplied Items * Customer rates Items * Did this customer rank supplier X highly? * Which suppliers sell the highest rated items? * Does item A get rated higher when ordered with Item B? * All can be answered with RDBs as well * Not as elegant * Not as performant
• * billions of nodes and relationships in a single instance * cluster replication * transactions * native bindings for Ruby, Python, and language that can run in JVM * Licensing * Neo4jPHP - Josh&apos;s REST client, no affiliated with Neo Technologies
• * Index can be saved separately * Or it is saved on `add` * Note that indexes don&apos;t have to be on real properties or values
• * This is where the power of graph dbs comes from * Paths - find any relationship chain between A and B * Traversal - filter out paths that don&apos;t meet criteria * Queries - Here is what I want, find it however you can
• * Paths deal with two known nodes * start and end point * This is the Kevin Bacon example, but with multiple datatypes  * Path can be treated as an array of nodes or relationships * findPathsTo() returns a PathFinder which can have further restrictions placed on it
• * Written in Javascript * plugins provide other languages: Groovy, Python * Anything that runs on JVM * Path object, check apidocs * inline edit/update/delete * explicit prune evaluator of maxDepth = 1 unless overriden * built in prune: none * built in return: all or all-but-start * Prune: should we continue doen this path? Return: Should we return the entity at this position? * You can return things and still continue traversing * Pros: expressive, powerful, complex search behaviors, in-line edit/update * Cons: complex to write, complex to understand (query languages make this better)
• * Not very familiar with it * Just mentioning it&apos;s out there
• * Cypher is &amp;quot;what to find&amp;quot; * describe the &amp;quot;shape&amp;quot; of the thing you&apos;re looking for * Very white-board friendly * Pros: easy to understand, query looks like domain model * Cons: not as powerful, not fully featured (YET) * result set is an array of arrays
• * Three parts ** Where to start ** Shape to find   ** possibly qualifiers ** What to return
• * If there could be more than one relationship type, could further constrain by ratings
• * Webadmin built into neo4j server
• * RDBs are really good at data aggregation * Set math, duh * Have to traverse the whole graph in order to do aggregation * Truly tabular means not a lot of relationships between the data types
• ### Graph Databases

2. 2. Who am I? <ul><ul><li>Software developer: PHP, Javascript, SQL </li></ul></ul><ul><ul><li>http://www.dunnwell.com </li></ul></ul><ul><ul><li>Fan of using the right tool for the job </li></ul></ul>
3. 3. The Problem
4. 4. The Solution? <ul><li>> -- Given &quot;Keanu Reeves&quot; find a connection to &quot;Kevin Bacon&quot; </li></ul><ul><li>> SELECT ??? FROM cast WHERE ??? </li></ul><ul><li>+---------------------------------------------------------------------+ </li></ul><ul><li>| actor_name                 | movie_title                            | </li></ul><ul><li>+============================+========================================+ </li></ul><ul><li>| Jennifer Connelley         | Higher Learning                        | </li></ul><ul><li>+----------------------------+----------------------------------------+ </li></ul><ul><li>| Laurence Fishburne         | Mystic River                           | </li></ul><ul><li>+----------------------------+----------------------------------------+ </li></ul><ul><li>| Laurence Fishburne         | Higher Learning                        | </li></ul><ul><li>+----------------------------+----------------------------------------+ </li></ul><ul><li>| Kevin Bacon                | Mystic River                           | </li></ul><ul><li>+----------------------------+----------------------------------------+ </li></ul><ul><li>| Keanu Reeves               | The Matrix                             | </li></ul><ul><li>+----------------------------+----------------------------------------+ </li></ul><ul><li>| Laurence Fishburne         | The Matrix                             | </li></ul><ul><li>+----------------------------+----------------------------------------+ </li></ul>
5. 5. Find Every Actor at Each Degree <ul><li>> -- First degree </li></ul><ul><li>> SELECT actor_name FROM cast WHERE movie_title IN (SELECT DISTINCT movie_title FROM cast WHERE actor_name='Kevin Bacon') </li></ul><ul><li>> -- Second degree </li></ul><ul><li>> SELECT actor_name FROM cast WHERE movie_title IN (SELECT DISTINCT movie_title FROM cast WHERE actor_name IN (SELECT actor_name FROM cast WHERE movie_title IN (SELECT DISTINCT movie_title FROM cast WHERE actor_name='Kevin Bacon'))) </li></ul><ul><li>> -- Third degree </li></ul><ul><li>> SELECT actor_name FROM cast WHERE movie_title IN(SELECT DISTINCT movie_title FROM cast WHERE actor_name IN (SELECT actor_name FROM cast WHERE movie_title IN (SELECT DISTINCT movie_title FROM cast WHERE actor_name IN (SELECT actor_name FROM cast WHERE movie_title IN (SELECT DISTINCT movie_title FROM cast WHERE actor_name='Kevin Bacon')))) </li></ul>
6. 6. The Truth <ul><li>Relational databases aren't very good with relationsh ips </li></ul>Data RDBMs
7. 7. RDBs Use Set Math
8. 8. The Real Problem <ul><li>Finding relationships across multiple degrees of separation </li></ul><ul><li>    ...and across multiple data types </li></ul><ul><li>    ...and where you don't even know there is a relationship </li></ul>
9. 9. The Real Solution
10. 10. Computer Science Definition <ul><li>A graph is an ordered pair G = (V, E) where V is a set of vertices and E is a set of edges , which are pairs of vertices. </li></ul>
11. 11. Some Graph DB Vocabulary <ul><ul><li>Node : vertex </li></ul></ul><ul><ul><li>Relationship : edge </li></ul></ul><ul><ul><li>Property : meta-datum attached to a node or relationship </li></ul></ul><ul><ul><li>Path : an ordered list of nodes and relationships </li></ul></ul><ul><ul><li>Index : node or relationship lookup table </li></ul></ul>
12. 12. Relationships are First-Class Citizens <ul><ul><li>Have a type </li></ul></ul><ul><ul><li>Have properties </li></ul></ul><ul><ul><li>Have a direction </li></ul></ul><ul><ul><ul><li>Domain semantics </li></ul></ul></ul><ul><ul><ul><li>Traversable in any direction </li></ul></ul></ul>
13. 13. Graph Examples
14. 14. Relational Databases are Graphs!
15. 15. New Solution to the Bacon Problem \$keanu = \$actorIndex->find('name', 'Keanu Reeves'); \$kevin = \$actorIndex->find('name', 'Kevin Bacon'); \$path = \$keanu->findPathTo(\$kevin);
16. 16. Some Graph Use Cases <ul><ul><li>Social networking </li></ul></ul><ul><ul><li>Manufacturing </li></ul></ul><ul><ul><li>Map directions </li></ul></ul><ul><ul><li>Fraud detection </li></ul></ul><ul><ul><li>Multi-tenancy </li></ul></ul>
17. 17. Modelling a Domain with Graphs <ul><ul><li>Graphs are &quot;whiteboard-friendly&quot; </li></ul></ul><ul><ul><li>Nouns become nodes </li></ul></ul><ul><ul><li>Verbs become relationships </li></ul></ul><ul><ul><li>Properties are adjectives and adverbs </li></ul></ul>
18. 18. Audience Participation!
19. 19. <ul><ul><li>Neo Technologies </li></ul></ul><ul><ul><li>http://neo4j.org </li></ul></ul><ul><ul><li>Embedded in Java applications </li></ul></ul><ul><ul><li>Standalone server via REST </li></ul></ul><ul><ul><li>Plugins: spatial, lucene, rdf </li></ul></ul><ul><ul><li>http://github.com/jadell/Neo4jPHP </li></ul></ul>
20. 20. Using the REST client <ul><li>\$client = new Client(new Transport()); </li></ul><ul><li>\$customer = new Node(\$client); </li></ul><ul><li>\$customer->setProperty('name', 'Josh')->save(); </li></ul><ul><li>\$store = new Node(\$client); </li></ul><ul><li>\$store->setProperty('name', 'Home Despot') </li></ul><ul><li>      ->setProperty('location', 'Durham, NC')->save(); </li></ul><ul><li>\$order = new Node(\$client); </li></ul><ul><li>\$order->save(); </li></ul><ul><li>\$item = new Node(\$client); </li></ul><ul><li>\$item->setProperty('item_number', 'Q32-ESM')->save(); </li></ul><ul><li>\$order->relateTo(\$item, 'CONTAINS')->save(); </li></ul><ul><li>\$customer->relateTo(\$order, 'BOUGHT')->save(); </li></ul><ul><li>\$store->relateTo(\$order, 'SOLD')->save(); </li></ul><ul><li>\$customerIndex = new Index(\$client, Index::TypeNode, 'customers'); </li></ul><ul><li>\$customerIndex->add(\$customer, 'name', \$customer->getProperty('name')); </li></ul><ul><li>\$customerIndex->add(\$customer, 'rating', 'A++'); </li></ul>
21. 21. Graph Mining <ul><ul><li>Paths </li></ul></ul><ul><ul><li>Traversals </li></ul></ul><ul><ul><li>Ad-hoc Queries </li></ul></ul>
22. 22. Path Finding <ul><ul><li>Find any connection from node A to node B </li></ul></ul><ul><ul><li>Limit by relationship types and/or direction </li></ul></ul><ul><ul><li>Path finding algorithms: all, simple, shortest, Dijkstra </li></ul></ul>\$customer = \$customerIndex->findOne('name', 'Josh'); \$item = \$itemIndex->findOne('item_number', 'Q32-ESM'); \$path = \$item->findPathsTo(\$customer)               ->setMaxDepth(2)               ->getSinglePath(); foreach (\$path as \$node) {     echo \$node->getId() . &quot;n&quot;; }
23. 23. Traversal <ul><ul><li>Complex/Custom path finding </li></ul></ul><ul><ul><li>Base next decision on previous path </li></ul></ul>\$traversal = new Traversal(\$client); \$traversal ->setOrder(Traversal::OrderDepthFirst) ->setUniqueness(Traversal::UniquenessNodeGlobal) ->setPruneEvaluator('javascript','(function traverse(pos) {       if (pos.length() == 1 && pos.lastRelationship.getType() == &quot;CONTAINS&quot;) {         return false;     } else if (pos.length() == 2 && pos.lastRelationship.getType() == &quot;BOUGHT&quot;) {         return false;      }     return true;})(position)') ->setReturnFilter('javascript',      'return position.endNode().getProperty('type') == 'Customer;'); \$customers = \$traversal->getResults(\$item, Traversal::ReturnTypeNode);
24. 24. <ul><ul><li>Uses mathematical notation approach </li></ul></ul><ul><ul><li>Complex traversal behaviors, including backtracking </li></ul></ul><ul><ul><li>https://github.com/tinkerpop/gremlin/wiki </li></ul></ul><ul><li>m = [:] </li></ul><ul><li>g.v(1).out('likes').in('likes').out('likes').groupCount(m) </li></ul><ul><li>m.sort{a,b -> a.value <=> b.value} </li></ul>
25. 25. Cypher <ul><ul><li>&quot;What to find&quot; vs. &quot;How to find&quot; </li></ul></ul>\$query = 'START item=(1) MATCH (item)<-[:CONTAINS]-(order)<-[:BOUGHT]-(customer) RETURN customer'; \$cypher = new CypherQuery(\$client, \$query); \$customers = \$cypher->getResultSet();
26. 26. Cypher Syntax <ul><li>START item = (1)                        START item = (1,2,3) </li></ul><ul><li>START item = (items, 'name:Q32*')       START item = (1), customer = (2,3) </li></ul><ul><li>MATCH (item)<--(order)                  MATCH (order)-->(item) </li></ul><ul><li>MATCH (order)-[r]->(item)                                MATCH ()--(item) </li></ul><ul><li>MATCH </li></ul><ul><li>     (supplier)-[:SUPPLIES]->(item)<-[:CONTAINS]-(order), </li></ul><ul><li>    (customer)-[:RATED]->(item) </li></ul><ul><li>WHERE customer.name = 'Josh' and s.coupon = 'freewidget' </li></ul><ul><li>RETURN item, order                      RETURN customer, item, r.rating </li></ul><ul><li>RETURN r~TYPE                                                        RETURN COUNT(*) </li></ul><ul><li>ORDER BY customer.name DESC             RETURN AVG(r.rating) </li></ul><ul><li>LIMIT 3 SKIP 2 </li></ul>
27. 27. Cypher - All Together Now <ul><li>// Find the top 10 `widget` ratings by customers who bought AND rated </li></ul><ul><li>// `widgets`, and the supplier </li></ul><ul><li>START item = (items, 'name:widget') </li></ul><ul><li>MATCH (item)<--(order)<--(customer)-[r:RATED]->(item)<--(supplier) </li></ul><ul><li>RETURN customer, r.rating, supplier ORDER BY r.rating DESC LIMIT 10 </li></ul>
28. 28. Tools <ul><ul><li>Neoclipse </li></ul></ul><ul><ul><li>Webadmin </li></ul></ul>
29. 29. Are RDBs Useful At All? <ul><ul><li>Aggregation </li></ul></ul><ul><ul><li>Ordered data </li></ul></ul><ul><ul><li>Truly tabular data </li></ul></ul><ul><ul><li>Few or clearly defined relationships </li></ul></ul>
30. 30. Questions?