Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Processing large-scale graphs with ... by ArangoDB Database 2173 views
- Handling billions of edges in a gra... by ArangoDB Database 1018 views
- Neo4j and the Panama Papers - FooCa... by Craig Taverner 278 views
- Mongo db improve the performance of... by Juan Antonio Roy ... 535 views
- Extensibility of a database api wit... by ArangoDB Database 1161 views
- Software + Babies by ArangoDB Database 711 views

1,429 views

Published on

The complexity and amount of data rises. Modern graph databases are designed to handle the complexity but still not for the amount of data. When hitting a certain size of a graph many dedicated graph databases reach their limits in vertical or most common horizontal scalability. In this talk I´ll provide a brief overview about current approaches and their limits towards scalability. Dealing with complex data in a complex system doesn't make things easier... but more fun finding a solution. Join me on my journey to handle billions of edges in a graph database.

Published in:
Technology

No Downloads

Total views

1,429

On SlideShare

0

From Embeds

0

Number of Embeds

33

Shares

0

Downloads

14

Comments

0

Likes

1

No embeds

No notes for slide

- 1. www.arangodb.com Handling Billions Of Edges in a Graph Database Michael Hackstein @mchacki New Technology
- 2. Michael Hackstein ‣ ArangoDB Core Team ‣ Web Frontend ‣ Graph visualisation ‣ Graph features ‣ SmartGraphs ‣ Host of cologne.js ‣ Master’s Degree (spec. Databases and Information Systems) 2
- 3. What are Graph Databases ‣ Schema-free Objects (Vertices) ‣ Relations between them (Edges) ‣ Edges have a direction 3 { name: "alice", age: 32 } { name: "bob", age: 35, size: 1,73m } { name: "ﬁshing" } { name: "reading" } { name: "dancing" } married hobby hobby hobby hobby
- 4. What are Graph Databases ‣ Schema-free Objects (Vertices) ‣ Relations between them (Edges) ‣ Edges have a direction ‣ Edges can be queried in both directions ‣ Easily query a range of edges (2 to 5) ‣ Undeﬁned number of edges (1 to *) ‣ Shortest Path between two vertices 3 { name: "alice", age: 32 } { name: "bob", age: 35, size: 1,73m } { name: "ﬁshing" } { name: "reading" } { name: "dancing" } married hobby hobby hobby hobby
- 5. Typical Graph Queries 4
- 6. Typical Graph Queries ‣ Give me all friends of Alice 4
- 7. Typical Graph Queries ‣ Give me all friends of Alice ‣ Give me all friends-of-friends of Alice 4
- 8. Typical Graph Queries ‣ Give me all friends of Alice ‣ Give me all friends-of-friends of Alice ‣ What is the linking path between Alice and Bob 4
- 9. Typical Graph Queries ‣ Give me all friends of Alice ‣ Give me all friends-of-friends of Alice ‣ What is the linking path between Alice and Bob ‣ Which Trainstations can I reach if I am allowed to drive a distance of 6 stations on my ticket 4
- 10. Typical Graph Queries ‣ Give me all friends of Alice ‣ Give me all friends-of-friends of Alice ‣ What is the linking path between Alice and Bob ‣ Which Trainstations can I reach if I am allowed to drive a distance of 6 stations on my ticket ‣ Pattern Matching: 4
- 11. Typical Graph Queries ‣ Give me all friends of Alice ‣ Give me all friends-of-friends of Alice ‣ What is the linking path between Alice and Bob ‣ Which Trainstations can I reach if I am allowed to drive a distance of 6 stations on my ticket ‣ Pattern Matching: ‣ Give me all users that share two hobbies with Alice 4
- 12. Typical Graph Queries ‣ Give me all friends of Alice ‣ Give me all friends-of-friends of Alice ‣ What is the linking path between Alice and Bob ‣ Which Trainstations can I reach if I am allowed to drive a distance of 6 stations on my ticket ‣ Pattern Matching: ‣ Give me all users that share two hobbies with Alice ‣ Give me all products that at least one of my friends has bought together with the products I already own, ordered by how many friends have bought it and the products rating, but only 20 of them. 4
- 13. Non-Typical Graph Queries 5
- 14. Non-Typical Graph Queries ‣ Give me all users which have an age attribute between 21 and 35. 5
- 15. Non-Typical Graph Queries ‣ Give me all users which have an age attribute between 21 and 35. ‣ Give me the age distribution of all users 5
- 16. Non-Typical Graph Queries ‣ Give me all users which have an age attribute between 21 and 35. ‣ Give me the age distribution of all users ‣ Group all users by their name 5
- 17. Traversal 6 Iterate down two edges with some ﬁlters
- 18. Traversal ‣ We ﬁrst pick a start vertex (S) 6 S Iterate down two edges with some ﬁlters
- 19. Traversal ‣ We ﬁrst pick a start vertex (S) ‣ We collect all edges on S 6 S A B C Iterate down two edges with some ﬁlters
- 20. Traversal ‣ We ﬁrst pick a start vertex (S) ‣ We collect all edges on S ‣ We apply ﬁlters on edges 6 S A B C Iterate down two edges with some ﬁlters
- 21. Traversal ‣ We ﬁrst pick a start vertex (S) ‣ We collect all edges on S ‣ We apply ﬁlters on edges ‣ We iterate down one of the new vertices (A) 6 S A B C D E Iterate down two edges with some ﬁlters
- 22. Traversal ‣ We ﬁrst pick a start vertex (S) ‣ We collect all edges on S ‣ We apply ﬁlters on edges ‣ We iterate down one of the new vertices (A) ‣ We apply ﬁlters on edges 6 S A B C D E Iterate down two edges with some ﬁlters
- 23. Traversal ‣ We ﬁrst pick a start vertex (S) ‣ We collect all edges on S ‣ We apply ﬁlters on edges ‣ We iterate down one of the new vertices (A) ‣ We apply ﬁlters on edges ‣ The next vertex (E) is in desired depth. Return the path S -> A -> E 6 S A B C D E Iterate down two edges with some ﬁlters
- 24. Traversal ‣ We ﬁrst pick a start vertex (S) ‣ We collect all edges on S ‣ We apply ﬁlters on edges ‣ We iterate down one of the new vertices (A) ‣ We apply ﬁlters on edges ‣ The next vertex (E) is in desired depth. Return the path S -> A -> E ‣ Go back to the next unﬁnished vertex (B) 6 S A B C D E Iterate down two edges with some ﬁlters
- 25. Traversal ‣ We ﬁrst pick a start vertex (S) ‣ We collect all edges on S ‣ We apply ﬁlters on edges ‣ We iterate down one of the new vertices (A) ‣ We apply ﬁlters on edges ‣ The next vertex (E) is in desired depth. Return the path S -> A -> E ‣ Go back to the next unﬁnished vertex (B) ‣ We iterate down on (B) 6 S A B C D E Iterate down two edges with some ﬁlters F
- 26. Traversal ‣ We ﬁrst pick a start vertex (S) ‣ We collect all edges on S ‣ We apply ﬁlters on edges ‣ We iterate down one of the new vertices (A) ‣ We apply ﬁlters on edges ‣ The next vertex (E) is in desired depth. Return the path S -> A -> E ‣ Go back to the next unﬁnished vertex (B) ‣ We iterate down on (B) ‣ We apply ﬁlters on edges 6 S A B C D E Iterate down two edges with some ﬁlters F
- 27. Traversal ‣ We ﬁrst pick a start vertex (S) ‣ We collect all edges on S ‣ We apply ﬁlters on edges ‣ We iterate down one of the new vertices (A) ‣ We apply ﬁlters on edges ‣ The next vertex (E) is in desired depth. Return the path S -> A -> E ‣ Go back to the next unﬁnished vertex (B) ‣ We iterate down on (B) ‣ We apply ﬁlters on edges ‣ The next vertex (F) is in desired depth. Return the path S -> B -> F 6 S A B C D E Iterate down two edges with some ﬁlters F
- 28. Traversal - Complexity ‣ Once: ‣ Find the start vertex ‣ For every depth: ‣ Find all connected edges ‣ Filter non-matching edges ‣ Find connected vertices ‣ Filter non-matching vertices 7 Depends on indexes: Hash: Edge-Index or Index-Free: Linear in edges: Depends on indexes: Hash: Linear in vertices: Only one pass: O 1 1 n n * 1 n 3n
- 29. Traversal - Complexity ‣ Linear sounds evil? ‣ NOT linear in All Edges O(E) ‣ Only Linear in relevant Edges n < E ‣ Traversals solely scale with their result size. ‣ They are not effected at all by total amount of data ‣ BUT: Every depth increases the exponent: O(3 * n ) ‣ "7 degrees of separation": 3*n < E < 3*n 8 d 6 7
- 30. ‣ MULTI-MODEL database ‣ Stores Documents and Graphs ‣ Query language AQL ‣ Document Queries ‣ Graph Queries ‣ Joins ‣ All can be combined in the same statement ‣ ACID support including Multi Collection Transactions 9
- 31. AQL 10 FOR user IN users RETURN user
- 32. AQL 11 FOR user IN users FILTER user.name == "alice" RETURN user
- 33. AQL 12 FOR user IN users FILTER user.name == "alice" FOR product IN OUTBOUND user has_bought RETURN product Alice TV has_bought
- 34. AQL 13 FOR user IN users FILTER user.name == "alice" FOR recommendation, action, path IN 3 ANY user has_bought FILTER path.vertices[2].age <= user.age + 5 AND path.vertices[2].age >= user.age - 5 FILTER recommendation.price < 25 LIMIT 10 RETURN recommendation Alice TV has_bought Bob Playstation has_boughthas_bought alice.age - 5 <= bob.age && bob.age <= alice.age + 5 playstation.price < 25
- 35. 14 Demo Time Querying basics
- 36. First Boost - Vertex Centric Indices ‣ Remember Complexity? O(3 * n ) ‣ Filtering of non-matching edges is linear for every depth ‣ Index all edges based on their vertices and arbitrary other attributes ‣ Find initial set of edges in identical time ‣ Less / No post-ﬁltering required ‣ This decreases the n 15 d
- 37. 16 Demo Time Vertex-Centric Indices
- 38. Scaling ‣ Vertex-Centric Indexes help with super-nodes ‣ But what if the graph is too large for one machine? ‣ Distribute graph on several machines (sharding) ‣ How to query it now? ‣ No global view of the graph possible any more ‣ What about edges between servers? 17
- 39. 18 First let's do the cluster thingy
- 40. 19
- 41. 20 Marathon
- 42. 21 Demo Time DC / OS
- 43. Is Mesosphere required? ‣ ArangoDB can run clusters without it ‣ Setup Requires manual effort (can be scripted): ‣ Conﬁgure IP addresses ‣ Correct startup ordering ‣ This works: ‣ Automatic Failover (Follower takes over if leader dies) ‣ Rebalancing of shards ‣ Everything inside of ArangoDB ‣ This is based on Mesos: ‣ Complete self healing ‣ Automatic restart of ArangoDBs (on new machines) ➡ We suggest you have someone on call 22
- 44. 23 Now distribute the graph
- 45. Dangers of Sharding ‣ Only parts of the graph on every machine ‣ Neighboring vertices may be on different machines ‣ Even edges could be on other machines than their vertices ‣ Queries need to be executed in a distributed way ‣ Result needs to be merged locally 24
- 46. Random Distribution ‣ Disadvantages: ‣ Neighbors on different machines ‣ Probably edges on other machines than their vertices ‣ A lot of network overhead is required for querying 25 ‣ Advantages: ‣ every server takes an equal portion of data ‣ easy to realize ‣ no knowledge about data required ‣ always works
- 47. Random Distribution ‣ Disadvantages: ‣ Neighbors on different machines ‣ Probably edges on other machines than their vertices ‣ A lot of network overhead is required for querying 25 ‣ Advantages: ‣ every server takes an equal portion of data ‣ easy to realize ‣ no knowledge about data required ‣ always works
- 48. Index-Free Adjacency 26 ‣ Used by most other graph databases ‣ Every vertex maintains two lists of it's edges (IN and OUT) ‣ Do not use an index to ﬁnd edges ‣ How to shard this?
- 49. Index-Free Adjacency 26 ‣ Used by most other graph databases ‣ Every vertex maintains two lists of it's edges (IN and OUT) ‣ Do not use an index to ﬁnd edges ‣ How to shard this?
- 50. Index-Free Adjacency 26 ‣ Used by most other graph databases ‣ Every vertex maintains two lists of it's edges (IN and OUT) ‣ Do not use an index to ﬁnd edges ‣ How to shard this?
- 51. Index-Free Adjacency 26 ‣ Used by most other graph databases ‣ Every vertex maintains two lists of it's edges (IN and OUT) ‣ Do not use an index to ﬁnd edges ‣ How to shard this? ????
- 52. Index-Free Adjacency ‣ ArangoDB uses an hash-based EdgeIndex (O(1) - lookup) ‣ The vertex is independent of it's edges ‣ It can be stored on a different machine 26 ‣ Used by most other graph databases ‣ Every vertex maintains two lists of it's edges (IN and OUT) ‣ Do not use an index to ﬁnd edges ‣ How to shard this? ????
- 53. Domain Based Distribution ‣ Many Graphs have a natural distribution ‣ By country/region for People ‣ By tags for Blogs ‣ By category for Products ‣ Most edges in same group ‣ Rare edges between groups 27
- 54. Domain Based Distribution ‣ Many Graphs have a natural distribution ‣ By country/region for People ‣ By tags for Blogs ‣ By category for Products ‣ Most edges in same group ‣ Rare edges between groups 27
- 55. Domain Based Distribution ‣ Many Graphs have a natural distribution ‣ By country/region for People ‣ By tags for Blogs ‣ By category for Products ‣ Most edges in same group ‣ Rare edges between groups 27 uses Domain Knowledge for short-cuts
- 56. 28 Sneak Preview SmartGraphs
- 57. Benchmark Comparison Source: https://www.arangodb.com/2015/10/benchmark-postgresql-mongodb-arangodb/
- 58. Thank you ‣ Further questions? ‣ Follow us on twitter: @arangodb ‣ Join our slack: slack.arangodb.com ‣ Follow me on twitter/github: @mchacki ‣ Write me a mail: michael@arangodb.com 30

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment