The complexity and amount of data rises. Modern graph databases are designed to handle the complexity but still not for the amount of data.
https://www.bigdataspain.org/2017/talk/scaling-to-billions-of-edges-in-a-graph-database
Big Data Spain 2017
16th - 17th November Kinépolis Madrid
By Michael Hackstein (@mchacki)
The complexity and amount of data rises. Modern graph databases are designed to handle the complexity but still not for the amount of data. When hitting a certain size of a graph many dedicated graph databases reach their limits in vertical or most common horizontal scalability. In this talk I´ll provide a brief overview about current approaches and their limits towards scalability. Dealing with complex data in a complex system doesn't make things easier... but more fun finding a solution. Join me on my journey to handle billions of edges in a graph database.
By Michael Hackstein (@mchacki)
The complexity and amount of data rises. Modern graph databases are designed to handle the complexity but still not for the amount of data. When hitting a certain size of a graph many dedicated graph databases reach their limits in vertical or most common horizontal scalability. In this talk I´ll provide a brief overview about current approaches and their limits towards scalability. Dealing with complex data in a complex system doesn't make things easier... but more fun finding a solution. Join me on my journey to handle billions of edges in a graph database.
1. Graph Data Structure
A graph is a pictorial representation of a set of objects where some pairs of objects are connected by links. The interconnected objects are represented by points termed as vertices, and the links that connect the vertices are called edges.
Formally, a graph is a pair of sets (V, E), where V is the set of vertices and Eis the set of edges, connecting the pairs of vertices. Take a look at the following graph
V = {a, b, c, d, e}
E = {ab, ac, bd, cd, de}
With the raise of NoSQL databases consistency models that are less strict than ACID transactions became popular again. After the first enthusiasm the developer community became aware that those relaxed consistency models hold some new challenges they never knew about in the ACID world. Fortunately there are some concepts around how to deal with those challenges. This presentation gives a rough introduction into the different consistency models that are available and their characteristics. Then it focusses on two techniques to deal with relaxed consistency. The first one are quorum-based reads and writes which provides a client-side strong consistency model (even if the database only implements eventual consistency). Then CRDTs (Conflict-free Replicated Data Types) are presented. CRDTs are self-stabilizing data structures which are designed for environments with extremely high availability requirements - and thus extremely weak consistency guarantees. Even though as always the majority of the information is on the voice track, I designed the slides in a way that they also provide some useful information without the voice.
A Graph is a non-linear data structure, which consists of vertices(or nodes) connected by edges(or arcs) where edges may be directed or undirected.
Graphs are a powerful and versatile data structure that easily allow you to represent real life relationships between different types of data (nodes).
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data Spain
Insights can only be as good as the data. The data quality domain is enormously large, so you need to understand your company pain points to know what to focus on first.
https://www.bigdataspain.org/2017/talk/big-data-big-quality
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Big Data Spain
2gether is a financial platform based on Blockchain, Big Data and Artificial Intelligence that allows interaction between users and third-party services in a single interface.
https://www.bigdataspain.org/2017/talk/scaling-a-backend-for-a-big-data-and-blockchain-environment
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
More Related Content
Similar to Scaling to billions of Edges in a Graph Database by Max Neunhoeffer at Big Data Spain 2017
1. Graph Data Structure
A graph is a pictorial representation of a set of objects where some pairs of objects are connected by links. The interconnected objects are represented by points termed as vertices, and the links that connect the vertices are called edges.
Formally, a graph is a pair of sets (V, E), where V is the set of vertices and Eis the set of edges, connecting the pairs of vertices. Take a look at the following graph
V = {a, b, c, d, e}
E = {ab, ac, bd, cd, de}
With the raise of NoSQL databases consistency models that are less strict than ACID transactions became popular again. After the first enthusiasm the developer community became aware that those relaxed consistency models hold some new challenges they never knew about in the ACID world. Fortunately there are some concepts around how to deal with those challenges. This presentation gives a rough introduction into the different consistency models that are available and their characteristics. Then it focusses on two techniques to deal with relaxed consistency. The first one are quorum-based reads and writes which provides a client-side strong consistency model (even if the database only implements eventual consistency). Then CRDTs (Conflict-free Replicated Data Types) are presented. CRDTs are self-stabilizing data structures which are designed for environments with extremely high availability requirements - and thus extremely weak consistency guarantees. Even though as always the majority of the information is on the voice track, I designed the slides in a way that they also provide some useful information without the voice.
A Graph is a non-linear data structure, which consists of vertices(or nodes) connected by edges(or arcs) where edges may be directed or undirected.
Graphs are a powerful and versatile data structure that easily allow you to represent real life relationships between different types of data (nodes).
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data Spain
Insights can only be as good as the data. The data quality domain is enormously large, so you need to understand your company pain points to know what to focus on first.
https://www.bigdataspain.org/2017/talk/big-data-big-quality
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Big Data Spain
2gether is a financial platform based on Blockchain, Big Data and Artificial Intelligence that allows interaction between users and third-party services in a single interface.
https://www.bigdataspain.org/2017/talk/scaling-a-backend-for-a-big-data-and-blockchain-environment
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Big Data Spain
All modern Big Data solutions, like Hadoop, Kafka or the rest of the ecosystem tools, are designed as distributed processes and as such include some sort of redundancy for High Availability.
https://www.bigdataspain.org/2017/talk/disaster-recovery-for-big-data
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Big Data Spain
In this presentation, attendees will see how to speed up existing Hadoop and Spark deployments by just making Apache Ignite responsible for RAM utilization. No code modifications, no new architecture from scratch!
https://www.bigdataspain.org/2017/talk/boost-hadoop-and-spark-with-in-memory-technologies
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Big Data Spain
The power of this new set of tools for Data Science. Is really easy to start applying these technics in your current workflow.
https://www.bigdataspain.org/2017/talk/data-science-for-lazy-people-automated-machine-learning
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Big Data Spain
GPUs on the cloud as Infrastructure as a Service (IaaS) seem a commodity. However to efficiently distribute deep learning tasks on several GPUs is challenging.
https://www.bigdataspain.org/2017/talk/training-deep-learning-models-on-multiple-gpus-in-the-cloud
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Big Data Spain
Unbalanced data is a specific data configuration that appears commonly in nature. Applying machine learning techniques to this kind of data is a difficult process, usually addressed by unbalanced reduction techniques.
https://www.bigdataspain.org/2017/talk/unbalanced-data-same-algorithms-different-techniques
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
State of the art time-series analysis with deep learning by Javier Ordóñez at...Big Data Spain
Time series related problems have traditionally been solved using engineered features obtained by heuristic processes.
https://www.bigdataspain.org/2017/talk/state-of-the-art-time-series-analysis-with-deep-learning
Big Data Spain 2017
November 16th - 17th
Trading at market speed with the latest Kafka features by Iñigo González at B...Big Data Spain
Not long ago only banks and hedge funds could afford doing automated and High Frequency Trading, that is, the ability to send buy commodities in microseconds intervals.
https://www.bigdataspain.org/2017/talk/trading-at-market-speed-with-the-latest-kafka-features
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Big Data Spain
The shift to stream processing at LinkedIn has accelerated over the past few years. We now have over 200 Samza applications in production processing more than 260B events per day.
https://www.bigdataspain.org/2017/talk/apache-samza-jake-maes
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...Big Data Spain
IBM has built a “Data Science Experience” cloud service that exposes Notebook services at web scale.
https://www.bigdataspain.org/2017/talk/the-analytic-platform-behind-ibms-watson-data-platform
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Big Data Spain
Artificial Intelligence and Data-centric businesses.
https://www.bigdataspain.org/2017/talk/tbc
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Big Data Spain
Ten years ago there were rumours of the death of causal inference. Big data was supposed to enable us to rely on purely correlational data to predict and control the world.
https://www.bigdataspain.org/2017/talk/why-big-data-didnt-end-causal-inference
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Big Data Spain
The Meme of the Internet Index will be the new normal to analyze and predict facts and sensations which go around the Internet.
https://www.bigdataspain.org/2017/talk/meme-index-analyzing-fads-and-sensations-on-the-internet
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Big Data Spain
Geotab is a leader in the expanding world of Internet of Things (IoT) and telematics industry with Big Data.
https://www.bigdataspain.org/2017/talk/vehicle-big-data-that-drives-smart-city-advancement
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...Big Data Spain
The talk will focus on explaining why operational databases do not scale due to limitations in legacy transactional management.
https://www.bigdataspain.org/2017/talk/end-of-the-myth-ultra-scalable-transactional-management
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Big Data Spain
In recent years Machine Learning (ML) and especially Deep Learning (DL) have achieved great success in many areas such as visual recognition, NLP or even aiding in medical research.
https://www.bigdataspain.org/2017/talk/attacking-machine-learning-used-in-antivirus-with-reinforcement
Big Data Spain 2017
16th - 17th Kinépolis Madrid
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...Big Data Spain
Primary function of banking sector is promoting economic activity; which means “commerce”, exchanging what someone produces-has for something that someone consumes-desires.
https://www.bigdataspain.org/2017/talk/more-people-less-banking-blockchain
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Big Data Spain
Bol.com has been an early Hadoop user: since 2008 where it was first built for a recommendation algorithm.
https://www.bigdataspain.org/2017/talk/make-the-elephant-fly-once-again
Big Data Spain 2017
16th - 17th Kinépolis Madrid
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
3. ‣ Michael Hackstein
‣ ArangoDB Core Team
‣ Graph visualisation
‣ Graph features
‣ SmartGraphs
‣ Host of cologne.js
‣ Master’s Degree (spec. Databases and
Information Systems)
4. {
name: "alice",
age: 32
}
{
name: "dancing"
}
{
name: "bob",
age: 35,
size: 1,73m
}
{
name: "reading"
}
{
name: "fishing"
}
hobby
hobby
hobby
hobby
‣ Schema-free Objects (Vertices)
‣ Relations between them (Edges)
‣ Edges have a direction
‣ Edges can be queried in both directions
‣ Easily query a range of edges (2 to 5)
‣ Undefined number of edges (1 to *)
‣ Shortest Path between two vertices
10. ProductAlice
Product
Friend
‣ Give me all products that at least one of my friends has bought together with
the products I already own, ordered by how many friends have bought it and
the products rating, but only 20 of them.
has_bought
has_bought
has_boughtis_friend
Product
11. ‣ Give me all users which have an age attribute between 21 and 35.
‣ Give me the age distribution of all users
‣ Group all users by their name
13. Traversal: Iterate down two edges with some filters
We first pick a start vertex (S)
We collect all edges on S
14. Traversal: Iterate down two edges with some filters
We first pick a start vertex (S)
We collect all edges on S
We apply filters on edges
15. Traversal: Iterate down two edges with some filters
We first pick a start vertex (S)
We collect all edges on S
We apply filters on edges
We iterate down one of the edges to (A)
16. Traversal: Iterate down two edges with some filters
We first pick a start vertex (S)
We collect all edges on S
We apply filters on edges
We iterate down one of the edges to (A)
We apply filters on edges
17. Traversal: Iterate down two edges with some filters
We first pick a start vertex (S)
We collect all edges on S
We apply filters on edges
We iterate down one of the edges to (A)
We apply filters on edges
The next vertex (E) is in desired depth.
Return the path S -> A -> E
18. Traversal: Iterate down two edges with some filters
We first pick a start vertex (S)
We collect all edges on S
We apply filters on edges
We iterate down one of the edges to (A)
We apply filters on edges
The next vertex (E) is in desired depth.
Return the path S -> A -> E
Go back to the next unfinished vertex (B)
19. Traversal: Iterate down two edges with some filters
We first pick a start vertex (S)
We collect all edges on S
We apply filters on edges
We iterate down one of the edges to (A)
We apply filters on edges
The next vertex (E) is in desired depth.
Return the path S -> A -> E
Go back to the next unfinished vertex (B)
We iterate down on (B)
20. Traversal: Iterate down two edges with some filters
We first pick a start vertex (S)
We collect all edges on S
We apply filters on edges
We iterate down one of the edges to (A)
We apply filters on edges
The next vertex (E) is in desired depth.
Return the path S -> A -> E
Go back to the next unfinished vertex (B)
We iterate down on (B)
We apply filters on edges
21. Traversal: Iterate down two edges with some filters
We first pick a start vertex (S)
We collect all edges on S
We apply filters on edges
We iterate down one of the edges to (A)
We apply filters on edges
The next vertex (E) is in desired depth.
Return the path S -> A -> E
Go back to the next unfinished vertex (B)
We iterate down on (B)
We apply filters on edges
The next vertex (F) is in desired depth.
Return the path S -> B -> F
22. Traversal: Complexity
Once: Operation Comment O
Find the start vertex Depends on indexes: Hash: 1
For every
depth:
Find all connected edges Edge-Index or Index-Free: 1
Filter non-matching edges Linear in edges: n
Find connected vertices Depends on indexes: Hash: n · 1
Filter non-matching vertices Linear in vertices: n
Total for
one pass:
3n
25. Traversal: Complexity
Linear sounds evil?
NOT linear in all edges O(E)
Only linear in relevant edges n < E
Traversals solely scale with their result size
26. Traversal: Complexity
Linear sounds evil?
NOT linear in all edges O(E)
Only linear in relevant edges n < E
Traversals solely scale with their result size
They are not effected at all by total amount of data
27. Traversal: Complexity
Linear sounds evil?
NOT linear in all edges O(E)
Only linear in relevant edges n < E
Traversals solely scale with their result size
They are not effected at all by total amount of data
BUT: Every depth increases the exponent: O((3n)d)
28. Traversal: Complexity
Linear sounds evil?
NOT linear in all edges O(E)
Only linear in relevant edges n < E
Traversals solely scale with their result size
They are not effected at all by total amount of data
BUT: Every depth increases the exponent: O((3n)d)
“7 degrees of separation”: n6 < E < n7
29. ‣ MULTI-MODEL database
‣ Stores Key Value, Documents, and Graphs
‣ All in one core
‣ Query language AQL
‣ Document Queries
‣ Graph Queries
‣ Joins
‣ All can be combined in the same statement
‣ ACID support including Multi Collection Transactions
+ +
31. FOR user IN users
FILTER user.name == "alice"
RETURN user
Alice
32. FOR user IN users
FILTER user.name == "alice"
FOR product IN OUTBOUND user has_bought
RETURN product
Alice
has_bought
TV
33. FOR user IN users
FILTER user.name == "alice"
FOR recommendation, action, path IN 3 ANY user has_bought
FILTER path.vertices[2].age <= user.age + 5
AND path.vertices[2].age >= user.age - 5
FILTER recommendation.price < 25
LIMIT 10
RETURN recommendation
Alice
has_bought
TV
has_bought
playstation.price < 25
PlaystationBob
alice.age - 5 <= bob.age &&
bob.age <= alice.age + 5
has_bought
34. ‣ Many graphs have "celebrities"
‣ Vertices with many inbound and/or outbound edges
‣ Traversing over them is expensive (linear in number of Edges)
‣ Often you only need a subset of edges
Bob Alice
35. ‣ Remember Complexity? O(3 * nd
)
‣ Filtering of non-matching edges is linear for every depth
‣ Index all edges based on their vertices and arbitrary other attributes
‣ Find initial set of edges in identical time
‣ Less / No post-filtering required
‣ This decreases the n significantly
Alice
36. ‣ We have the rise of big data
‣ Store everything you can
‣ Dataset easily grows beyond one machine
‣ This includes graph data!
38. Scaling horizontally
Distribute graph on several machines (sharding)
How to query it now?
No global view of the graph possible any more
What about edges between servers?
39. Scaling horizontally
Distribute graph on several machines (sharding)
How to query it now?
No global view of the graph possible any more
What about edges between servers?
In a sharded environment the network most of the time is the bottleneck
Reduce network hops
40. Scaling horizontally
Distribute graph on several machines (sharding)
How to query it now?
No global view of the graph possible any more
What about edges between servers?
In a sharded environment the network most of the time is the bottleneck
Reduce network hops
Vertex-Centric Indexes again help with super-nodes
But: Only on a local machine
41. Random distribution
Advantages:
every server takes an equal portion
of the data
easy to realize
no knowledge about data required
always works
Disadvantages:
Neighbors on different machines
Probably edges on other machines
than their vertices
A lot of network overhead is
required for querying
42. Random distribution
Advantages:
every server takes an equal portion
of the data
easy to realize
no knowledge about data required
always works
Disadvantages:
Neighbors on different machines
Probably edges on other machines
than their vertices
A lot of network overhead is
required for querying
43. Domain-based Distribution
Many Graphs have a natural distribution
By country/region for People
By tags for Blogs
By category for Products
Most edges in the same group
Rare edges between groups
44. Domain-based Distribution
Many Graphs have a natural distribution
By country/region for People
By tags for Blogs
By category for Products
Most edges in the same group
Rare edges between groups
45. Domain-based Distribution
Many Graphs have a natural distribution
By country/region for People
By tags for Blogs
By category for Products
Most edges in the same group
Rare edges between groups
49. ‣ ArangoDB uses a hash-based edge index (O(1) - lookup)
‣ The vertex is independent of its edges
‣ It can be stored on a different machine
‣ Used by most other graph databases
‣ Every vertex maintains two lists of it's edges (IN and OUT)
‣ Do not use an index to find edges
‣ How to shard this?
????
50. ‣ Further questions?
‣ Follow us on twitter: @arangodb
‣ Join our slack: slack.arangodb.com
‣ Follow me on twitter/github: @mchacki
Thank You