Successfully reported this slideshow.
Upcoming SlideShare
×

Handling Billions of Edges in a Graph Database

1,429 views

Published on

By Michael Hackstein (@mchacki)

The complexity and amount of data rises. Modern graph databases are designed to handle the complexity but still not for the amount of data. When hitting a certain size of a graph many dedicated graph databases reach their limits in vertical or most common horizontal scalability. In this talk I´ll provide a brief overview about current approaches and their limits towards scalability. Dealing with complex data in a complex system doesn't make things easier... but more fun finding a solution. Join me on my journey to handle billions of edges in a graph database.

Published in: Technology
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Handling Billions of Edges in a Graph Database

1. 1. www.arangodb.com Handling Billions Of Edges in a Graph Database Michael Hackstein @mchacki New Technology
2. 2. Michael Hackstein ‣ ArangoDB Core Team ‣ Web Frontend ‣ Graph visualisation ‣ Graph features ‣ SmartGraphs ‣ Host of cologne.js ‣ Master’s Degree  (spec. Databases and  Information Systems) 2
3. 3. What are Graph Databases ‣ Schema-free Objects (Vertices) ‣ Relations between them (Edges) ‣ Edges have a direction 3 { name: "alice", age: 32 } { name: "bob", age: 35, size: 1,73m } { name: "ﬁshing" } { name: "reading" } { name: "dancing" } married hobby hobby hobby hobby
4. 4. What are Graph Databases ‣ Schema-free Objects (Vertices) ‣ Relations between them (Edges) ‣ Edges have a direction ‣ Edges can be queried in both directions ‣ Easily query a range of edges (2 to 5) ‣ Undeﬁned number of edges (1 to *) ‣ Shortest Path between two vertices 3 { name: "alice", age: 32 } { name: "bob", age: 35, size: 1,73m } { name: "ﬁshing" } { name: "reading" } { name: "dancing" } married hobby hobby hobby hobby
5. 5. Typical Graph Queries 4
6. 6. Typical Graph Queries ‣ Give me all friends of Alice 4
7. 7. Typical Graph Queries ‣ Give me all friends of Alice ‣ Give me all friends-of-friends of Alice 4
8. 8. Typical Graph Queries ‣ Give me all friends of Alice ‣ Give me all friends-of-friends of Alice ‣ What is the linking path between Alice and Bob 4
9. 9. Typical Graph Queries ‣ Give me all friends of Alice ‣ Give me all friends-of-friends of Alice ‣ What is the linking path between Alice and Bob ‣ Which Trainstations can I reach if I am allowed to drive a distance of 6 stations on my ticket 4
10. 10. Typical Graph Queries ‣ Give me all friends of Alice ‣ Give me all friends-of-friends of Alice ‣ What is the linking path between Alice and Bob ‣ Which Trainstations can I reach if I am allowed to drive a distance of 6 stations on my ticket ‣ Pattern Matching: 4
11. 11. Typical Graph Queries ‣ Give me all friends of Alice ‣ Give me all friends-of-friends of Alice ‣ What is the linking path between Alice and Bob ‣ Which Trainstations can I reach if I am allowed to drive a distance of 6 stations on my ticket ‣ Pattern Matching: ‣ Give me all users that share two hobbies with Alice 4
12. 12. Typical Graph Queries ‣ Give me all friends of Alice ‣ Give me all friends-of-friends of Alice ‣ What is the linking path between Alice and Bob ‣ Which Trainstations can I reach if I am allowed to drive a distance of 6 stations on my ticket ‣ Pattern Matching: ‣ Give me all users that share two hobbies with Alice ‣ Give me all products that at least one of my friends has bought together with the products I already own, ordered by how many friends have bought it and the products rating, but only 20 of them. 4
13. 13. Non-Typical Graph Queries 5
14. 14. Non-Typical Graph Queries ‣ Give me all users which have an age attribute between 21 and 35. 5
15. 15. Non-Typical Graph Queries ‣ Give me all users which have an age attribute between 21 and 35. ‣ Give me the age distribution of all users 5
16. 16. Non-Typical Graph Queries ‣ Give me all users which have an age attribute between 21 and 35. ‣ Give me the age distribution of all users ‣ Group all users by their name 5
17. 17. Traversal 6 Iterate down two edges with some ﬁlters
18. 18. Traversal ‣ We ﬁrst pick a start vertex (S) 6 S Iterate down two edges with some ﬁlters
19. 19. Traversal ‣ We ﬁrst pick a start vertex (S) ‣ We collect all edges on S 6 S A B C Iterate down two edges with some ﬁlters
20. 20. Traversal ‣ We ﬁrst pick a start vertex (S) ‣ We collect all edges on S ‣ We apply ﬁlters on edges 6 S A B C Iterate down two edges with some ﬁlters
21. 21. Traversal ‣ We ﬁrst pick a start vertex (S) ‣ We collect all edges on S ‣ We apply ﬁlters on edges ‣ We iterate down one of the new vertices (A) 6 S A B C D E Iterate down two edges with some ﬁlters
22. 22. Traversal ‣ We ﬁrst pick a start vertex (S) ‣ We collect all edges on S ‣ We apply ﬁlters on edges ‣ We iterate down one of the new vertices (A) ‣ We apply ﬁlters on edges 6 S A B C D E Iterate down two edges with some ﬁlters
23. 23. Traversal ‣ We ﬁrst pick a start vertex (S) ‣ We collect all edges on S ‣ We apply ﬁlters on edges ‣ We iterate down one of the new vertices (A) ‣ We apply ﬁlters on edges ‣ The next vertex (E) is in desired depth. Return the path S -> A -> E 6 S A B C D E Iterate down two edges with some ﬁlters
24. 24. Traversal ‣ We ﬁrst pick a start vertex (S) ‣ We collect all edges on S ‣ We apply ﬁlters on edges ‣ We iterate down one of the new vertices (A) ‣ We apply ﬁlters on edges ‣ The next vertex (E) is in desired depth. Return the path S -> A -> E ‣ Go back to the next unﬁnished vertex (B) 6 S A B C D E Iterate down two edges with some ﬁlters
25. 25. Traversal ‣ We ﬁrst pick a start vertex (S) ‣ We collect all edges on S ‣ We apply ﬁlters on edges ‣ We iterate down one of the new vertices (A) ‣ We apply ﬁlters on edges ‣ The next vertex (E) is in desired depth. Return the path S -> A -> E ‣ Go back to the next unﬁnished vertex (B) ‣ We iterate down on (B) 6 S A B C D E Iterate down two edges with some ﬁlters F
26. 26. Traversal ‣ We ﬁrst pick a start vertex (S) ‣ We collect all edges on S ‣ We apply ﬁlters on edges ‣ We iterate down one of the new vertices (A) ‣ We apply ﬁlters on edges ‣ The next vertex (E) is in desired depth. Return the path S -> A -> E ‣ Go back to the next unﬁnished vertex (B) ‣ We iterate down on (B) ‣ We apply ﬁlters on edges 6 S A B C D E Iterate down two edges with some ﬁlters F
27. 27. Traversal ‣ We ﬁrst pick a start vertex (S) ‣ We collect all edges on S ‣ We apply ﬁlters on edges ‣ We iterate down one of the new vertices (A) ‣ We apply ﬁlters on edges ‣ The next vertex (E) is in desired depth. Return the path S -> A -> E ‣ Go back to the next unﬁnished vertex (B) ‣ We iterate down on (B) ‣ We apply ﬁlters on edges ‣ The next vertex (F) is in desired depth. Return the path S -> B -> F 6 S A B C D E Iterate down two edges with some ﬁlters F
28. 28. Traversal - Complexity ‣ Once: ‣ Find the start vertex ‣ For every depth: ‣ Find all connected edges ‣ Filter non-matching edges ‣ Find connected vertices ‣ Filter non-matching vertices 7 Depends on indexes: Hash: Edge-Index or Index-Free: Linear in edges: Depends on indexes: Hash: Linear in vertices: Only one pass: O 1 1 n n * 1 n 3n
29. 29. Traversal - Complexity ‣ Linear sounds evil? ‣ NOT linear in All Edges O(E) ‣ Only Linear in relevant Edges n < E ‣ Traversals solely scale with their result size. ‣ They are not effected at all by total amount of data ‣ BUT: Every depth increases the exponent: O(3 * n ) ‣ "7 degrees of separation": 3*n < E < 3*n 8 d 6 7
30. 30. ‣ MULTI-MODEL database ‣ Stores Documents and Graphs ‣ Query language AQL ‣ Document Queries ‣ Graph Queries ‣ Joins ‣ All can be combined in the same statement ‣ ACID support including Multi Collection Transactions 9
31. 31. AQL 10 FOR user IN users RETURN user
32. 32. AQL 11 FOR user IN users FILTER user.name == "alice" RETURN user
33. 33. AQL 12 FOR user IN users FILTER user.name == "alice" FOR product IN OUTBOUND user has_bought RETURN product Alice TV has_bought
34. 34. AQL 13 FOR user IN users FILTER user.name == "alice" FOR recommendation, action, path IN 3 ANY user has_bought FILTER path.vertices[2].age <= user.age + 5 AND path.vertices[2].age >= user.age - 5 FILTER recommendation.price < 25 LIMIT 10 RETURN recommendation Alice TV has_bought Bob Playstation has_boughthas_bought alice.age - 5 <= bob.age && bob.age <= alice.age + 5 playstation.price < 25
35. 35. 14 Demo Time Querying basics
36. 36. First Boost - Vertex Centric Indices ‣ Remember Complexity? O(3 * n ) ‣ Filtering of non-matching edges is linear for every depth ‣ Index all edges based on their vertices and arbitrary other attributes ‣ Find initial set of edges in identical time ‣ Less / No post-ﬁltering required ‣ This decreases the n 15 d
37. 37. 16 Demo Time Vertex-Centric Indices
38. 38. Scaling ‣ Vertex-Centric Indexes help with super-nodes ‣ But what if the graph is too large for one machine? ‣ Distribute graph on several machines (sharding) ‣ How to query it now? ‣ No global view of the graph possible any more ‣ What about edges between servers? 17
39. 39. 18 First let's do the cluster thingy
40. 40. 19
41. 41. 20 Marathon
42. 42. 21 Demo Time DC / OS
43. 43. Is Mesosphere required? ‣ ArangoDB can run clusters without it ‣ Setup Requires manual effort (can be scripted): ‣ Conﬁgure IP addresses ‣ Correct startup ordering ‣ This works: ‣ Automatic Failover (Follower takes over if leader dies) ‣ Rebalancing of shards ‣ Everything inside of ArangoDB ‣ This is based on Mesos: ‣ Complete self healing ‣ Automatic restart of ArangoDBs (on new machines) ➡ We suggest you have someone on call 22
44. 44. 23 Now distribute the graph
45. 45. Dangers of Sharding ‣ Only parts of the graph on every machine ‣ Neighboring vertices may be on different machines ‣ Even edges could be on other machines than their vertices ‣ Queries need to be executed in a distributed way ‣ Result needs to be merged locally 24
46. 46. Random Distribution ‣ Disadvantages: ‣ Neighbors on different machines ‣ Probably edges on other machines than their vertices ‣ A lot of network overhead is required for querying 25 ‣ Advantages: ‣ every server takes an equal portion of data ‣ easy to realize ‣ no knowledge about data required ‣ always works
47. 47. Random Distribution ‣ Disadvantages: ‣ Neighbors on different machines ‣ Probably edges on other machines than their vertices ‣ A lot of network overhead is required for querying 25 ‣ Advantages: ‣ every server takes an equal portion of data ‣ easy to realize ‣ no knowledge about data required ‣ always works
48. 48. Index-Free Adjacency 26 ‣ Used by most other graph databases ‣ Every vertex maintains two lists of it's edges (IN and OUT) ‣ Do not use an index to ﬁnd edges ‣ How to shard this?
49. 49. Index-Free Adjacency 26 ‣ Used by most other graph databases ‣ Every vertex maintains two lists of it's edges (IN and OUT) ‣ Do not use an index to ﬁnd edges ‣ How to shard this?
50. 50. Index-Free Adjacency 26 ‣ Used by most other graph databases ‣ Every vertex maintains two lists of it's edges (IN and OUT) ‣ Do not use an index to ﬁnd edges ‣ How to shard this?
51. 51. Index-Free Adjacency 26 ‣ Used by most other graph databases ‣ Every vertex maintains two lists of it's edges (IN and OUT) ‣ Do not use an index to ﬁnd edges ‣ How to shard this? ????
52. 52. Index-Free Adjacency ‣ ArangoDB uses an hash-based EdgeIndex (O(1) - lookup) ‣ The vertex is independent of it's edges ‣ It can be stored on a different machine 26 ‣ Used by most other graph databases ‣ Every vertex maintains two lists of it's edges (IN and OUT) ‣ Do not use an index to ﬁnd edges ‣ How to shard this? ????
53. 53. Domain Based Distribution ‣ Many Graphs have a natural distribution ‣ By country/region for People ‣ By tags for Blogs ‣ By category for Products ‣ Most edges in same group ‣ Rare edges between groups 27
54. 54. Domain Based Distribution ‣ Many Graphs have a natural distribution ‣ By country/region for People ‣ By tags for Blogs ‣ By category for Products ‣ Most edges in same group ‣ Rare edges between groups 27
55. 55. Domain Based Distribution ‣ Many Graphs have a natural distribution ‣ By country/region for People ‣ By tags for Blogs ‣ By category for Products ‣ Most edges in same group ‣ Rare edges between groups 27 uses Domain Knowledge  for short-cuts
56. 56. 28 Sneak Preview SmartGraphs
57. 57. Benchmark Comparison Source: https://www.arangodb.com/2015/10/benchmark-postgresql-mongodb-arangodb/