Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- GRAPHITE — An Extensible Graph Trav... by Marcus Paradies 403 views
- GRAPHITE: An Extensible Graph Trav... by Marcus Paradies 350 views
- GRATIN: Accelerating Graph Traversa... by Marcus Paradies 384 views
- Entity Matching for Semistructured ... by Marcus Paradies 872 views
- The Graph Story of the SAP HANA Dat... by Marcus Paradies 6581 views

1,220 views

Published on

Published in:
Technology

No Downloads

Total views

1,220

On SlideShare

0

From Embeds

0

Number of Embeds

8

Shares

0

Downloads

30

Comments

0

Likes

2

No embeds

No notes for slide

- 1. Marcus Paradies Challenges in the Design of a Graph Database Benchmark FOSDEM‘12 – Graph Processing DevRoom© Prof. Dr.-Ing. Wolfgang Lehner |
- 2. > Outline Motivation Challenges Thoughts on Graph Data Generation Thoughts on Query Workload Summary and Outlook Discussion Marcus Paradies | FOSDEM 2012 | 1
- 3. > Motivation Graph databases are gaining momentum Enterprise corporations are getting interested How to compare the available graph database vendors? Main issue: Results from benchmarks are not comparable Lack of standardization in the data model and query language What are “typical“ graph operations? Marcus Paradies | FOSDEM 2012 | 2
- 4. > Challenges Marcus Paradies | FOSDEM 2012 | 3
- 5. > Challenge #1: Application Domain Graph data is not homogenous Graph data from different domains follows different patterns Examples: Social Network Analysis (SNA) Protein Interaction Analysis Recommendation Systems Supply Chain Management (Vehicle Routing, CRM) Fraud Detection in Financial Systems … Challenge: Find an application domain which represents a graph data pattern common in many different scenarios. Marcus Paradies | FOSDEM 2012 | 4
- 6. > Challenge #2: Graph Data Model What flavours of graph data models are commonly used? Marcus Paradies | FOSDEM 2012 | 5
- 7. > Challenge #2: Graph Data Model Directed Graph Marcus Paradies | FOSDEM 2012 | 6
- 8. > Challenge #2: Graph Data Model Directed Graph Undirected Graph Marcus Paradies | FOSDEM 2012 | 7
- 9. > Challenge #2: Graph Data Model Directed Graph Undirected Graph Mixed Graph Marcus Paradies | FOSDEM 2012 | 8
- 10. > Challenge #2: Graph Data Model Directed Graph Undirected Graph Mixed Graph Multi Graph Marcus Paradies | FOSDEM 2012 | 9
- 11. > Challenge #2: Graph Data Model (Plain) Property Directed Graph Graph Undirected Graph Mixed Graph Multi Graph Marcus Paradies | FOSDEM 2012 | 10
- 12. > Challenge #2: Graph Data Model (StructuredProperty Graph) (Plain) Property Directed Graph Graph Undirected Graph Mixed Graph Multi Graph Marcus Paradies | FOSDEM 2012 | 11
- 13. > Challenge #2: Graph Data Model (StructuredProperty Graph) (Plain) Property Directed Graph Graph Undirected Graph Mixed Graph Multi Graph Hyper Graph Marcus Paradies | FOSDEM 2012 | 12
- 14. > Challenge #2: Graph Data Model (StructuredProperty Graph) (Plain) Property Directed Graph Graph Undirected Graph Mixed Graph Multi Graph Hyper Graph Challenge: Find a graph data model suited for the majority of use cases from various domains. Marcus Paradies | FOSDEM 2012 | 13
- 15. > Challenge #3: Querying Graph Data Large variety in graph processing and manipulation languages Each graph database vendor implements own query languages/APIs Reason: No standardized graph query language available Marcus Paradies | FOSDEM 2012 | 14
- 16. > Challenge #3: Querying Graph Data Large variety in graph processing and manipulation languages Each graph database vendor implements own query languages/APIs Reason: No standardized graph query language available Challenge: Find a way to abstract from the zoo of available query languages. Marcus Paradies | FOSDEM 2012 | 15
- 17. > Challenge #4: Defining the Workload The workload to be defined is dependent from the underlying query/manipulation language Should complex (algorithmic) operations be part of a database benchmark? Which algorithms to pick? Social Network Analysis → Find communities Supply Chain Management → Find maximal flow Web of Data → Find pattern matches How are concurrent users represented? What about transactionality? Marcus Paradies | FOSDEM 2012 | 16
- 18. > Thoughts on Graph Data Generation Marcus Paradies | FOSDEM 2012 | 17
- 19. > Graph Data Generation - Patterns Understanding graph patterns (characteristics) is crucical for a good graph data generator What are distinguishing characteristics of graphs? How can we identify graph patterns on large graphs? Three main patterns [1]: Power law distributed Small diameters Community Effects ? ? = = Marcus Paradies | FOSDEM 2012 | 18
- 20. > Pattern 1 – Power law distributed source: [2] source: [2] Most real-world graph data sets follow a power law distribution Examples: Internet router graph Subsets of the WWW Citation Graphs Marcus Paradies | FOSDEM 2012 | 19
- 21. > Pattern 2 – Small Diameters Effective Diameter (eccentricity): Minimum number of hops, in which a fraction (e.g. 90%) of all connected pairs of nodes can reach each other Other measures exist as well, but are not applicable to disconnected graphs In most use cases, diameter is much smaller than the size of the graph Examples: 97% eccentricity of around 16 for path lengths in the WWW Average path length around 6 for Epinions social network source: [1] Marcus Paradies | FOSDEM 2012 | 20
- 22. > Pattern 3 – Community Effects Community: A set of nodes, where each node in the set is closer to all other nodes in the community than to nodes outside the community. Communities can be found in many real-world graphs, especially social networks and collaboration networks Clustering Coefficient C: A measure, which qualifies the „clumpiness“ of a graph Marcus Paradies | FOSDEM 2012 | 21
- 23. > Thoughts on Query Workload Marcus Paradies | FOSDEM 2012 | 22
- 24. > Query Workload - Operations Graph Manipulation Operations Add/Update/Remove Nodes from the Graph Add/Update/Remove Edges from the Graph Add/Update/Remove Edge attributes Add/Update/Remove Node attributes Graph Query Operations Retrieve selection of nodes from given filter expression Getting the neighbors of a set of nodes (possibly with edge filter constraints) Graph Traversals Based on basic query operations Exploration of neighborhood from a given set of start nodes Terminated by the number of steps and/or edge/node filter constraints Graph Analytical Operations Aggregation operations such as sum, avg, min, max Aggregations on node-level and on edge-level Marcus Paradies | FOSDEM 2012 | 23
- 25. > Query Workload - Measures Closely related to benchmark capabilities Measures from relational benchmarks apply such as Average query response time Transactions per second (throughput) Additional measures for graph traversals Traversals per second What about distributed scenarios? What about concurrent users? Marcus Paradies | FOSDEM 2012 | 24
- 26. > Summary and Outlook Graph data distribution highly important for graph database benchmark Application domains do have very specific graph characteristics A graph database benchmark has to provide abstract and high-level graph operation descriptions Feel free to contact me if you want to contribute: marcus.paradies@gmail.com Marcus Paradies | FOSDEM 2012 | 25
- 27. > Discussion Marcus Paradies | FOSDEM 2012 | 26
- 28. > Theses A benchmark based on social network data is nice, but might be not be that representative for large enterprise applications Algorithms should NOT be part of a graph database benchmark Only support basic operations such as simple lookups and path traversals The underlying graph data model should be a simple property graph A graph database has to scale in terms of data size as well as number of concurrent users .... Marcus Paradies | FOSDEM 2012 | 27
- 29. > References [1] Graph Mining: Laws, Generators, and Algorithms (2006) [2] http://konect.uni-koblenz.de/ [3] A Discussion on the Design of Graph Database Benchmarks (2010) Marcus Paradies | FOSDEM 2012 | 28

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment