Your SlideShare is downloading. ×
0
×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Challenges in the Design of a Graph Database Benchmark

2,887

Published on

Graph databases are one of the leading drivers in the emerging, highly heterogeneous landscape of database management systems for non-relational data management and processing. The recent interest and …

Graph databases are one of the leading drivers in the emerging, highly heterogeneous landscape of database management systems for non-relational data management and processing. The recent interest and success of graph databases arises mainly from the growing interest in social media analysis and the exploration and mining of relationships in social media data. However, with a graph-based model as a very flexible underlying data model, a graph database can serve a large variety of scenarios from different domains such as travel planning, supply chain management and package routing.
During the past months, many vendors have designed and implemented solutions to satisfy the need to efficiently store, manage and query graph data. However, the solutions are very diverse in terms of the supported graph data model, supported query languages, and APIs. With a growing number of vendors offering graph processing and graph management functionality, there is also an increased need to compare the solutions on a functional level as well as on a performance level with the help of benchmarks. Graph database benchmarking is a challenging task. Already existing graph database benchmarks are limited in their functionality and portability to different graph-based data models and different application domains. Existing benchmarks and the supported workloads are typically based on a proprietary query language and on a specific graph-based data model derived from the mathematical notion of a graph. The variety and lack of standardization with respect to the logical representation of graph data and the retrieval of graph data make it hard to define a portable graph database benchmark. In this talk, we present a proposal and design guideline for a graph database benchmark. Typically, a database benchmark consists of a synthetically generated data set of varying size and varying characteristics and a workload driver. In order to generate graph data sets, we present parameters from graph theory, which influence the characteristics of the generated graph data set. Following, the workload driver issues a set of queries against a well-defined interface of the graph database and gathers relevant performance numbers. We propose a set of performance measures to determine the response time behavior on different workloads and also initial suggestions for typical workloads in graph data scenarios. Our main objective of this session is to open the discussion on graph database benchmarking. We believe that there is a need for a common understanding of different workloads for graph processing from different domains and the definition of a common subset of core graph functionality in order to provide a general-purpose graph database benchmark. We encourage vendors to participate and to contribute with their domain-dependent knowledge and to define a graph database benchmark proposal.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,887
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
51
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. © Prof. Dr.-Ing. Wolfgang Lehner | Challenges in the Design of a Graph Database Benchmark FOSDEM‘12 – Graph Processing DevRoom Marcus Paradies
  • 2. Marcus Paradies | | 1 > Outline  Motivation  Challenges  Thoughts on Graph Data Generation  Thoughts on Query Workload  Summary and Outlook  Discussion FOSDEM 2012
  • 3. Marcus Paradies | | 2 > Motivation FOSDEM 2012  Graph databases are gaining momentum  Enterprise corporations are getting interested  How to compare the available graph database vendors?  Main issue: Results from benchmarks are not comparable  Lack of standardization in the data model and query language  What are “typical“ graph operations?
  • 4. Marcus Paradies | | 3 > Challenges FOSDEM 2012
  • 5. Marcus Paradies | | 4 > Challenge #1: Application Domain  Graph data is not homogenous  Graph data from different domains follows different patterns  Examples:  Social Network Analysis (SNA)  Protein Interaction Analysis  Recommendation Systems  Supply Chain Management (Vehicle Routing, CRM)  Fraud Detection in Financial Systems  … Challenge: Find an application domain which represents a graph data pattern common in many different scenarios. FOSDEM 2012
  • 6. Marcus Paradies | | 5 > Challenge #2: Graph Data Model FOSDEM 2012 What flavours of graph data models are commonly used?
  • 7. Marcus Paradies | | 6 > Challenge #2: Graph Data Model FOSDEM 2012 Directed Graph
  • 8. Marcus Paradies | | 7 > Challenge #2: Graph Data Model FOSDEM 2012 Directed Graph Undirected Graph
  • 9. Marcus Paradies | | 8 > Challenge #2: Graph Data Model FOSDEM 2012 Directed Graph Undirected Graph Mixed Graph
  • 10. Marcus Paradies | | 9 > Challenge #2: Graph Data Model FOSDEM 2012 Directed Graph Undirected Graph Mixed Graph Multi Graph
  • 11. Marcus Paradies | | 10 > Challenge #2: Graph Data Model FOSDEM 2012 Directed Graph Undirected Graph Mixed Graph Multi Graph (Plain) Property Graph
  • 12. Marcus Paradies | | 11 > Challenge #2: Graph Data Model FOSDEM 2012 Directed Graph Undirected Graph Mixed Graph Multi Graph (Plain) Property Graph (Structured Property Graph)
  • 13. Marcus Paradies | | 12 > Challenge #2: Graph Data Model FOSDEM 2012 Directed Graph Undirected Graph Mixed Graph Multi Graph (Plain) Property Graph (Structured Property Graph) Hyper Graph
  • 14. Marcus Paradies | | 13 > Challenge #2: Graph Data Model FOSDEM 2012 Directed Graph Undirected Graph Mixed Graph Multi Graph (Plain) Property Graph (Structured Property Graph) Hyper Graph Challenge: Find a graph data model suited for the majority of use cases from various domains.
  • 15. Marcus Paradies | | 14 > Challenge #3: Querying Graph Data FOSDEM 2012  Large variety in graph processing and manipulation languages  Each graph database vendor implements own query languages/APIs  Reason: No standardized graph query language available
  • 16. Marcus Paradies | | 15 > Challenge #3: Querying Graph Data FOSDEM 2012  Large variety in graph processing and manipulation languages  Each graph database vendor implements own query languages/APIs  Reason: No standardized graph query language available Challenge: Find a way to abstract from the zoo of available query languages.
  • 17. Marcus Paradies | | 16 > Challenge #4: Defining the Workload FOSDEM 2012  The workload to be defined is dependent from the underlying query/manipulation language  Should complex (algorithmic) operations be part of a database benchmark?  Which algorithms to pick?  Social Network Analysis → Find communities  Supply Chain Management → Find maximal flow  Web of Data → Find pattern matches  How are concurrent users represented?  What about transactionality?
  • 18. Marcus Paradies | | 17 > Thoughts on Graph Data Generation FOSDEM 2012
  • 19. Marcus Paradies | | 18 > Graph Data Generation - Patterns FOSDEM 2012  Understanding graph patterns (characteristics) is crucical for a good graph data generator  What are distinguishing characteristics of graphs?  How can we identify graph patterns on large graphs?  Three main patterns [1]:  Power law distributed  Small diameters  Community Effects ? = ? =
  • 20. Marcus Paradies | | 19 > Pattern 1 – Power law distributed FOSDEM 2012  Most real-world graph data sets follow a power law distribution  Examples:  Internet router graph  Subsets of the WWW  Citation Graphs source: [2] source: [2]
  • 21. Marcus Paradies | | 20 > Pattern 2 – Small Diameters FOSDEM 2012  Effective Diameter (eccentricity): Minimum number of hops, in which a fraction (e.g. 90%) of all connected pairs of nodes can reach each other  Other measures exist as well, but are not applicable to disconnected graphs  In most use cases, diameter is much smaller than the size of the graph  Examples:  97% eccentricity of around 16 for path lengths in the WWW  Average path length around 6 for Epinions social network source: [1]
  • 22. Marcus Paradies | | 21 > Pattern 3 – Community Effects FOSDEM 2012  Community: A set of nodes, where each node in the set is closer to all other nodes in the community than to nodes outside the community.  Communities can be found in many real-world graphs, especially social networks and collaboration networks  Clustering Coefficient C: A measure, which qualifies the „clumpiness“ of a graph
  • 23. Marcus Paradies | | 22 > Thoughts on Query Workload FOSDEM 2012
  • 24. Marcus Paradies | | 23 > Query Workload - Operations FOSDEM 2012  Graph Manipulation Operations  Add/Update/Remove Nodes from the Graph  Add/Update/Remove Edges from the Graph  Add/Update/Remove Edge attributes  Add/Update/Remove Node attributes  Graph Query Operations  Retrieve selection of nodes from given filter expression  Getting the neighbors of a set of nodes (possibly with edge filter constraints)  Graph Traversals  Based on basic query operations  Exploration of neighborhood from a given set of start nodes  Terminated by the number of steps and/or edge/node filter constraints  Graph Analytical Operations  Aggregation operations such as sum, avg, min, max  Aggregations on node-level and on edge-level
  • 25. Marcus Paradies | | 24 > Query Workload - Measures FOSDEM 2012  Closely related to benchmark capabilities  Measures from relational benchmarks apply such as  Average query response time  Transactions per second (throughput)  Additional measures for graph traversals  Traversals per second  What about distributed scenarios?  What about concurrent users?
  • 26. Marcus Paradies | | 25 > Summary and Outlook  Graph data distribution highly important for graph database benchmark  Application domains do have very specific graph characteristics  A graph database benchmark has to provide abstract and high-level graph operation descriptions  Feel free to contact me if you want to contribute: marcus.paradies@gmail.com FOSDEM 2012
  • 27. Marcus Paradies | | 26 > Discussion FOSDEM 2012
  • 28. Marcus Paradies | | 27 > Theses  A benchmark based on social network data is nice, but might be not be that representative for large enterprise applications  Algorithms should NOT be part of a graph database benchmark  Only support basic operations such as simple lookups and path traversals  The underlying graph data model should be a simple property graph  A graph database has to scale in terms of data size as well as number of concurrent users  .... FOSDEM 2012
  • 29. Marcus Paradies | | 28 > References [1] Graph Mining: Laws, Generators, and Algorithms (2006) [2] http://konect.uni-koblenz.de/ [3] A Discussion on the Design of Graph Database Benchmarks (2010) FOSDEM 2012

×