0
Marcus Paradies          Challenges in the Design of a Graph Database          Benchmark          FOSDEM‘12 – Graph Proces...
> Outline     Motivation     Challenges     Thoughts on Graph Data Generation     Thoughts on Query Workload     Summ...
> Motivation  Graph databases are gaining momentum  Enterprise corporations are getting interested  How to compare the ...
>                        Challenges    Marcus Paradies |                FOSDEM 2012   |   3
> Challenge #1: Application Domain  Graph data is not homogenous  Graph data from different domains follows different pa...
> Challenge #2: Graph Data Model         What flavours of graph data models                are commonly used?   Marcus Par...
> Challenge #2: Graph Data Model                       Directed Graph   Marcus Paradies |                    FOSDEM 2012  ...
> Challenge #2: Graph Data Model                       Directed Graph                         Undirected Graph   Marcus Pa...
> Challenge #2: Graph Data Model                       Directed Graph                         Undirected Graph           M...
> Challenge #2: Graph Data Model                       Directed Graph                         Undirected Graph           M...
> Challenge #2: Graph Data Model                                         (Plain) Property                       Directed G...
> Challenge #2: Graph Data Model  (StructuredProperty Graph)                     (Plain) Property         Directed Graph  ...
> Challenge #2: Graph Data Model  (StructuredProperty Graph)                     (Plain) Property         Directed Graph  ...
> Challenge #2: Graph Data Model  (StructuredProperty Graph)                                (Plain) Property         Direc...
> Challenge #3: Querying Graph Data   Large variety in graph processing and manipulation languages   Each graph database...
> Challenge #3: Querying Graph Data   Large variety in graph processing and manipulation languages   Each graph database...
> Challenge #4: Defining the Workload  The workload to be defined is dependent from the underlying   query/manipulation l...
>                Thoughts on Graph Data Generation    Marcus Paradies |                         FOSDEM 2012   |   17
> Graph Data Generation - Patterns  Understanding graph patterns (characteristics) is crucical for a good graph   data ge...
> Pattern 1 – Power law distributed                            source: [2]                        source: [2]  Most real-...
> Pattern 2 – Small Diameters   Effective Diameter (eccentricity): Minimum number of hops, in which a    fraction (e.g. 9...
> Pattern 3 – Community Effects   Community: A set of nodes, where each node in the set is closer to all other    nodes i...
>                        Thoughts on Query Workload    Marcus Paradies |                                FOSDEM 2012   |   22
> Query Workload - Operations  Graph Manipulation Operations     Add/Update/Remove Nodes from the Graph     Add/Update/...
> Query Workload - Measures  Closely related to benchmark capabilities  Measures from relational benchmarks apply such a...
> Summary and Outlook  Graph data distribution highly important for graph database benchmark  Application domains do hav...
>                        Discussion    Marcus Paradies |                FOSDEM 2012   |   26
> Theses  A benchmark based on social network data is nice, but might be not be that   representative for large enterpris...
> References [1] Graph Mining: Laws, Generators, and Algorithms (2006) [2] http://konect.uni-koblenz.de/ [3] A Discussion ...
Upcoming SlideShare
Loading in...5
×

Challenges in the Design of a Graph Database Benchmark

916

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
916
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
29
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Challenges in the Design of a Graph Database Benchmark"

  1. 1. Marcus Paradies Challenges in the Design of a Graph Database Benchmark FOSDEM‘12 – Graph Processing DevRoom© Prof. Dr.-Ing. Wolfgang Lehner |
  2. 2. > Outline  Motivation  Challenges  Thoughts on Graph Data Generation  Thoughts on Query Workload  Summary and Outlook  Discussion Marcus Paradies | FOSDEM 2012 | 1
  3. 3. > Motivation  Graph databases are gaining momentum  Enterprise corporations are getting interested  How to compare the available graph database vendors?  Main issue: Results from benchmarks are not comparable  Lack of standardization in the data model and query language  What are “typical“ graph operations? Marcus Paradies | FOSDEM 2012 | 2
  4. 4. > Challenges Marcus Paradies | FOSDEM 2012 | 3
  5. 5. > Challenge #1: Application Domain  Graph data is not homogenous  Graph data from different domains follows different patterns  Examples:  Social Network Analysis (SNA)  Protein Interaction Analysis  Recommendation Systems  Supply Chain Management (Vehicle Routing, CRM)  Fraud Detection in Financial Systems  … Challenge: Find an application domain which represents a graph data pattern common in many different scenarios. Marcus Paradies | FOSDEM 2012 | 4
  6. 6. > Challenge #2: Graph Data Model What flavours of graph data models are commonly used? Marcus Paradies | FOSDEM 2012 | 5
  7. 7. > Challenge #2: Graph Data Model Directed Graph Marcus Paradies | FOSDEM 2012 | 6
  8. 8. > Challenge #2: Graph Data Model Directed Graph Undirected Graph Marcus Paradies | FOSDEM 2012 | 7
  9. 9. > Challenge #2: Graph Data Model Directed Graph Undirected Graph Mixed Graph Marcus Paradies | FOSDEM 2012 | 8
  10. 10. > Challenge #2: Graph Data Model Directed Graph Undirected Graph Mixed Graph Multi Graph Marcus Paradies | FOSDEM 2012 | 9
  11. 11. > Challenge #2: Graph Data Model (Plain) Property Directed Graph Graph Undirected Graph Mixed Graph Multi Graph Marcus Paradies | FOSDEM 2012 | 10
  12. 12. > Challenge #2: Graph Data Model (StructuredProperty Graph) (Plain) Property Directed Graph Graph Undirected Graph Mixed Graph Multi Graph Marcus Paradies | FOSDEM 2012 | 11
  13. 13. > Challenge #2: Graph Data Model (StructuredProperty Graph) (Plain) Property Directed Graph Graph Undirected Graph Mixed Graph Multi Graph Hyper Graph Marcus Paradies | FOSDEM 2012 | 12
  14. 14. > Challenge #2: Graph Data Model (StructuredProperty Graph) (Plain) Property Directed Graph Graph Undirected Graph Mixed Graph Multi Graph Hyper Graph Challenge: Find a graph data model suited for the majority of use cases from various domains. Marcus Paradies | FOSDEM 2012 | 13
  15. 15. > Challenge #3: Querying Graph Data  Large variety in graph processing and manipulation languages  Each graph database vendor implements own query languages/APIs  Reason: No standardized graph query language available Marcus Paradies | FOSDEM 2012 | 14
  16. 16. > Challenge #3: Querying Graph Data  Large variety in graph processing and manipulation languages  Each graph database vendor implements own query languages/APIs  Reason: No standardized graph query language available Challenge: Find a way to abstract from the zoo of available query languages. Marcus Paradies | FOSDEM 2012 | 15
  17. 17. > Challenge #4: Defining the Workload  The workload to be defined is dependent from the underlying query/manipulation language  Should complex (algorithmic) operations be part of a database benchmark?  Which algorithms to pick?  Social Network Analysis → Find communities  Supply Chain Management → Find maximal flow  Web of Data → Find pattern matches  How are concurrent users represented?  What about transactionality? Marcus Paradies | FOSDEM 2012 | 16
  18. 18. > Thoughts on Graph Data Generation Marcus Paradies | FOSDEM 2012 | 17
  19. 19. > Graph Data Generation - Patterns  Understanding graph patterns (characteristics) is crucical for a good graph data generator  What are distinguishing characteristics of graphs?  How can we identify graph patterns on large graphs?  Three main patterns [1]:  Power law distributed  Small diameters  Community Effects ? ? = = Marcus Paradies | FOSDEM 2012 | 18
  20. 20. > Pattern 1 – Power law distributed source: [2] source: [2]  Most real-world graph data sets follow a power law distribution  Examples:  Internet router graph  Subsets of the WWW  Citation Graphs Marcus Paradies | FOSDEM 2012 | 19
  21. 21. > Pattern 2 – Small Diameters  Effective Diameter (eccentricity): Minimum number of hops, in which a fraction (e.g. 90%) of all connected pairs of nodes can reach each other  Other measures exist as well, but are not applicable to disconnected graphs  In most use cases, diameter is much smaller than the size of the graph  Examples:  97% eccentricity of around 16 for path lengths in the WWW  Average path length around 6 for Epinions social network source: [1] Marcus Paradies | FOSDEM 2012 | 20
  22. 22. > Pattern 3 – Community Effects  Community: A set of nodes, where each node in the set is closer to all other nodes in the community than to nodes outside the community.  Communities can be found in many real-world graphs, especially social networks and collaboration networks  Clustering Coefficient C: A measure, which qualifies the „clumpiness“ of a graph Marcus Paradies | FOSDEM 2012 | 21
  23. 23. > Thoughts on Query Workload Marcus Paradies | FOSDEM 2012 | 22
  24. 24. > Query Workload - Operations  Graph Manipulation Operations  Add/Update/Remove Nodes from the Graph  Add/Update/Remove Edges from the Graph  Add/Update/Remove Edge attributes  Add/Update/Remove Node attributes  Graph Query Operations  Retrieve selection of nodes from given filter expression  Getting the neighbors of a set of nodes (possibly with edge filter constraints)  Graph Traversals  Based on basic query operations  Exploration of neighborhood from a given set of start nodes  Terminated by the number of steps and/or edge/node filter constraints  Graph Analytical Operations  Aggregation operations such as sum, avg, min, max  Aggregations on node-level and on edge-level Marcus Paradies | FOSDEM 2012 | 23
  25. 25. > Query Workload - Measures  Closely related to benchmark capabilities  Measures from relational benchmarks apply such as  Average query response time  Transactions per second (throughput)  Additional measures for graph traversals  Traversals per second  What about distributed scenarios?  What about concurrent users? Marcus Paradies | FOSDEM 2012 | 24
  26. 26. > Summary and Outlook  Graph data distribution highly important for graph database benchmark  Application domains do have very specific graph characteristics  A graph database benchmark has to provide abstract and high-level graph operation descriptions  Feel free to contact me if you want to contribute: marcus.paradies@gmail.com Marcus Paradies | FOSDEM 2012 | 25
  27. 27. > Discussion Marcus Paradies | FOSDEM 2012 | 26
  28. 28. > Theses  A benchmark based on social network data is nice, but might be not be that representative for large enterprise applications  Algorithms should NOT be part of a graph database benchmark  Only support basic operations such as simple lookups and path traversals  The underlying graph data model should be a simple property graph  A graph database has to scale in terms of data size as well as number of concurrent users  .... Marcus Paradies | FOSDEM 2012 | 27
  29. 29. > References [1] Graph Mining: Laws, Generators, and Algorithms (2006) [2] http://konect.uni-koblenz.de/ [3] A Discussion on the Design of Graph Database Benchmarks (2010) Marcus Paradies | FOSDEM 2012 | 28
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×