SlideShare a Scribd company logo
1 of 29
Download to read offline
Marcus Paradies

          Challenges in the Design of a Graph Database
          Benchmark
          FOSDEM‘12 – Graph Processing DevRoom




© Prof. Dr.-Ing. Wolfgang Lehner |
> Outline


     Motivation
     Challenges
     Thoughts on Graph Data Generation
     Thoughts on Query Workload
     Summary and Outlook
     Discussion




   Marcus Paradies |                      FOSDEM 2012   |   1
> Motivation

  Graph databases are gaining momentum

  Enterprise corporations are getting interested

  How to compare the available graph database vendors?

  Main issue: Results from benchmarks are not comparable

  Lack of standardization in the data model and query language

  What are “typical“ graph operations?




  Marcus Paradies |                                           FOSDEM 2012   |   2
>




                        Challenges




    Marcus Paradies |                FOSDEM 2012   |   3
> Challenge #1: Application Domain

  Graph data is not homogenous

  Graph data from different domains follows different patterns

  Examples:
      Social Network Analysis (SNA)
      Protein Interaction Analysis
      Recommendation Systems
      Supply Chain Management (Vehicle Routing, CRM)
      Fraud Detection in Financial Systems
      …

 Challenge: Find an application domain which represents a graph data pattern
                common in many different scenarios.

   Marcus Paradies |                                              FOSDEM 2012   |   4
> Challenge #2: Graph Data Model




         What flavours of graph data models
                are commonly used?




   Marcus Paradies |                   FOSDEM 2012   |   5
> Challenge #2: Graph Data Model



                       Directed Graph




   Marcus Paradies |                    FOSDEM 2012   |   6
> Challenge #2: Graph Data Model



                       Directed Graph

                         Undirected Graph




   Marcus Paradies |                        FOSDEM 2012   |   7
> Challenge #2: Graph Data Model



                       Directed Graph

                         Undirected Graph

           Mixed Graph




   Marcus Paradies |                        FOSDEM 2012   |   8
> Challenge #2: Graph Data Model



                       Directed Graph

                         Undirected Graph

           Mixed Graph                  Multi Graph



   Marcus Paradies |                              FOSDEM 2012   |   9
> Challenge #2: Graph Data Model


                                         (Plain) Property
                       Directed Graph
                                              Graph
                         Undirected Graph

           Mixed Graph                  Multi Graph



   Marcus Paradies |                              FOSDEM 2012   |   10
> Challenge #2: Graph Data Model
  (Structured
Property Graph)                     (Plain) Property
         Directed Graph
                                         Graph
                       Undirected Graph

           Mixed Graph             Multi Graph



   Marcus Paradies |                         FOSDEM 2012   |   11
> Challenge #2: Graph Data Model
  (Structured
Property Graph)                     (Plain) Property
         Directed Graph
                                         Graph
                       Undirected Graph

           Mixed Graph     Multi Graph
               Hyper Graph



   Marcus Paradies |                         FOSDEM 2012   |   12
> Challenge #2: Graph Data Model
  (Structured
Property Graph)                                (Plain) Property
         Directed Graph
                                                    Graph
                            Undirected Graph

           Mixed Graph     Multi Graph
               Hyper Graph
  Challenge: Find a graph data model suited for the majority of use cases
                       from various domains.

   Marcus Paradies |                                            FOSDEM 2012   |   13
> Challenge #3: Querying Graph Data




   Large variety in graph processing and manipulation languages
   Each graph database vendor implements own query languages/APIs
   Reason: No standardized graph query language available




   Marcus Paradies |                                           FOSDEM 2012   |   14
> Challenge #3: Querying Graph Data




   Large variety in graph processing and manipulation languages
   Each graph database vendor implements own query languages/APIs
   Reason: No standardized graph query language available


  Challenge: Find a way to abstract from the zoo of available query languages.

   Marcus Paradies |                                            FOSDEM 2012   |   15
> Challenge #4: Defining the Workload

  The workload to be defined is dependent from the underlying
   query/manipulation language

  Should complex (algorithmic) operations be part of a database benchmark?

  Which algorithms to pick?
   Social Network Analysis → Find communities
   Supply Chain Management → Find maximal flow
   Web of Data → Find pattern matches

  How are concurrent users represented?

  What about transactionality?




   Marcus Paradies |                                             FOSDEM 2012   |   16
>




                Thoughts on Graph Data Generation




    Marcus Paradies |                         FOSDEM 2012   |   17
> Graph Data Generation - Patterns


  Understanding graph patterns (characteristics) is crucical for a good graph
   data generator
  What are distinguishing characteristics of graphs?
  How can we identify graph patterns on large graphs?
  Three main patterns [1]:
     Power law distributed
     Small diameters
     Community Effects




                              ?                  ?
                              =                  =

   Marcus Paradies |                                             FOSDEM 2012   |   18
> Pattern 1 – Power law distributed




                            source: [2]                        source: [2]


  Most real-world graph data sets follow a power law distribution
  Examples:
   Internet router graph
   Subsets of the WWW
   Citation Graphs


   Marcus Paradies |                                             FOSDEM 2012   |   19
> Pattern 2 – Small Diameters

   Effective Diameter (eccentricity): Minimum number of hops, in which a
    fraction (e.g. 90%) of all connected pairs of nodes can reach each other
   Other measures exist as well, but are not applicable to disconnected graphs
   In most use cases, diameter is much smaller than the size of the graph
   Examples:
    97% eccentricity of around 16 for path lengths in the WWW
    Average path length around 6 for Epinions social network




                                                     source: [1]
   Marcus Paradies |                                               FOSDEM 2012   |   20
> Pattern 3 – Community Effects


   Community: A set of nodes, where each node in the set is closer to all other
    nodes in the community than to nodes outside the community.
   Communities can be found in many real-world graphs, especially social
    networks and collaboration networks
   Clustering Coefficient C: A measure, which qualifies the „clumpiness“ of a
    graph




   Marcus Paradies |                                             FOSDEM 2012   |   21
>




                        Thoughts on Query Workload




    Marcus Paradies |                                FOSDEM 2012   |   22
> Query Workload - Operations

  Graph Manipulation Operations
     Add/Update/Remove Nodes from the Graph
     Add/Update/Remove Edges from the Graph
     Add/Update/Remove Edge attributes
     Add/Update/Remove Node attributes
  Graph Query Operations
   Retrieve selection of nodes from given filter expression
   Getting the neighbors of a set of nodes (possibly with edge filter constraints)
  Graph Traversals
   Based on basic query operations
   Exploration of neighborhood from a given set of start nodes
   Terminated by the number of steps and/or edge/node filter constraints
  Graph Analytical Operations
   Aggregation operations such as sum, avg, min, max
   Aggregations on node-level and on edge-level



   Marcus Paradies |                                                     FOSDEM 2012   |   23
> Query Workload - Measures


  Closely related to benchmark capabilities

  Measures from relational benchmarks apply such as
   Average query response time
   Transactions per second (throughput)

  Additional measures for graph traversals
   Traversals per second

  What about distributed scenarios?

  What about concurrent users?




   Marcus Paradies |                                   FOSDEM 2012   |   24
> Summary and Outlook

  Graph data distribution highly important for graph database benchmark

  Application domains do have very specific graph characteristics

  A graph database benchmark has to provide abstract and high-level graph
   operation descriptions



  Feel free to contact me if you want to contribute:

                            marcus.paradies@gmail.com




   Marcus Paradies |                                            FOSDEM 2012   |   25
>




                        Discussion




    Marcus Paradies |                FOSDEM 2012   |   26
> Theses



  A benchmark based on social network data is nice, but might be not be that
   representative for large enterprise applications

  Algorithms should NOT be part of a graph database benchmark

  Only support basic operations such as simple lookups and path traversals

  The underlying graph data model should be a simple property graph

  A graph database has to scale in terms of data size as well as number of
   concurrent users

  ....



   Marcus Paradies |                                             FOSDEM 2012   |   27
> References



 [1] Graph Mining: Laws, Generators, and Algorithms (2006)

 [2] http://konect.uni-koblenz.de/

 [3] A Discussion on the Design of Graph Database Benchmarks (2010)




   Marcus Paradies |                                          FOSDEM 2012   |   28

More Related Content

Similar to Challenges in the Design of a Graph Database Benchmark

Cloud software engineering
Cloud software engineeringCloud software engineering
Cloud software engineeringIan Sommerville
 
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...SANGHEE SHIN
 
IRJET- Comparatively Analysis on K-Means++ and Mini Batch K-Means Clustering ...
IRJET- Comparatively Analysis on K-Means++ and Mini Batch K-Means Clustering ...IRJET- Comparatively Analysis on K-Means++ and Mini Batch K-Means Clustering ...
IRJET- Comparatively Analysis on K-Means++ and Mini Batch K-Means Clustering ...IRJET Journal
 
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...dbpublications
 
P209 leithiser-relationaldb-formal-specifications
P209 leithiser-relationaldb-formal-specificationsP209 leithiser-relationaldb-formal-specifications
P209 leithiser-relationaldb-formal-specificationsBob Leithiser
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoopAnusha sweety
 
Petroleum Data Models for spatial data
Petroleum Data Models for spatial dataPetroleum Data Models for spatial data
Petroleum Data Models for spatial dataabsvis
 
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...The Hive
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logisticsTed Dunning
 
Innovation in model driven software
Innovation in model driven softwareInnovation in model driven software
Innovation in model driven softwareSagi Schliesser
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKTaposh Roy
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsDilum Bandara
 
DataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsDataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsEllen Friedman
 

Similar to Challenges in the Design of a Graph Database Benchmark (20)

Cloud software engineering
Cloud software engineeringCloud software engineering
Cloud software engineering
 
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
 
2008.11560v2.pdf
2008.11560v2.pdf2008.11560v2.pdf
2008.11560v2.pdf
 
IRJET- Comparatively Analysis on K-Means++ and Mini Batch K-Means Clustering ...
IRJET- Comparatively Analysis on K-Means++ and Mini Batch K-Means Clustering ...IRJET- Comparatively Analysis on K-Means++ and Mini Batch K-Means Clustering ...
IRJET- Comparatively Analysis on K-Means++ and Mini Batch K-Means Clustering ...
 
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)
 
P209 leithiser-relationaldb-formal-specifications
P209 leithiser-relationaldb-formal-specificationsP209 leithiser-relationaldb-formal-specifications
P209 leithiser-relationaldb-formal-specifications
 
Hadoop Mapreduce
Hadoop MapreduceHadoop Mapreduce
Hadoop Mapreduce
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoop
 
Petroleum Data Models for spatial data
Petroleum Data Models for spatial dataPetroleum Data Models for spatial data
Petroleum Data Models for spatial data
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
DSM Extraction from Pleiades Images using Micmac
DSM Extraction from Pleiades Images using MicmacDSM Extraction from Pleiades Images using Micmac
DSM Extraction from Pleiades Images using Micmac
 
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
B1803031217
B1803031217B1803031217
B1803031217
 
Innovation in model driven software
Innovation in model driven softwareInnovation in model driven software
Innovation in model driven software
 
Resilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARKResilient Distributed DataSets - Apache SPARK
Resilient Distributed DataSets - Apache SPARK
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel Problems
 
DataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsDataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven Organizations
 

Recently uploaded

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 

Recently uploaded (20)

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 

Challenges in the Design of a Graph Database Benchmark

  • 1. Marcus Paradies Challenges in the Design of a Graph Database Benchmark FOSDEM‘12 – Graph Processing DevRoom © Prof. Dr.-Ing. Wolfgang Lehner |
  • 2. > Outline  Motivation  Challenges  Thoughts on Graph Data Generation  Thoughts on Query Workload  Summary and Outlook  Discussion Marcus Paradies | FOSDEM 2012 | 1
  • 3. > Motivation  Graph databases are gaining momentum  Enterprise corporations are getting interested  How to compare the available graph database vendors?  Main issue: Results from benchmarks are not comparable  Lack of standardization in the data model and query language  What are “typical“ graph operations? Marcus Paradies | FOSDEM 2012 | 2
  • 4. > Challenges Marcus Paradies | FOSDEM 2012 | 3
  • 5. > Challenge #1: Application Domain  Graph data is not homogenous  Graph data from different domains follows different patterns  Examples:  Social Network Analysis (SNA)  Protein Interaction Analysis  Recommendation Systems  Supply Chain Management (Vehicle Routing, CRM)  Fraud Detection in Financial Systems  … Challenge: Find an application domain which represents a graph data pattern common in many different scenarios. Marcus Paradies | FOSDEM 2012 | 4
  • 6. > Challenge #2: Graph Data Model What flavours of graph data models are commonly used? Marcus Paradies | FOSDEM 2012 | 5
  • 7. > Challenge #2: Graph Data Model Directed Graph Marcus Paradies | FOSDEM 2012 | 6
  • 8. > Challenge #2: Graph Data Model Directed Graph Undirected Graph Marcus Paradies | FOSDEM 2012 | 7
  • 9. > Challenge #2: Graph Data Model Directed Graph Undirected Graph Mixed Graph Marcus Paradies | FOSDEM 2012 | 8
  • 10. > Challenge #2: Graph Data Model Directed Graph Undirected Graph Mixed Graph Multi Graph Marcus Paradies | FOSDEM 2012 | 9
  • 11. > Challenge #2: Graph Data Model (Plain) Property Directed Graph Graph Undirected Graph Mixed Graph Multi Graph Marcus Paradies | FOSDEM 2012 | 10
  • 12. > Challenge #2: Graph Data Model (Structured Property Graph) (Plain) Property Directed Graph Graph Undirected Graph Mixed Graph Multi Graph Marcus Paradies | FOSDEM 2012 | 11
  • 13. > Challenge #2: Graph Data Model (Structured Property Graph) (Plain) Property Directed Graph Graph Undirected Graph Mixed Graph Multi Graph Hyper Graph Marcus Paradies | FOSDEM 2012 | 12
  • 14. > Challenge #2: Graph Data Model (Structured Property Graph) (Plain) Property Directed Graph Graph Undirected Graph Mixed Graph Multi Graph Hyper Graph Challenge: Find a graph data model suited for the majority of use cases from various domains. Marcus Paradies | FOSDEM 2012 | 13
  • 15. > Challenge #3: Querying Graph Data  Large variety in graph processing and manipulation languages  Each graph database vendor implements own query languages/APIs  Reason: No standardized graph query language available Marcus Paradies | FOSDEM 2012 | 14
  • 16. > Challenge #3: Querying Graph Data  Large variety in graph processing and manipulation languages  Each graph database vendor implements own query languages/APIs  Reason: No standardized graph query language available Challenge: Find a way to abstract from the zoo of available query languages. Marcus Paradies | FOSDEM 2012 | 15
  • 17. > Challenge #4: Defining the Workload  The workload to be defined is dependent from the underlying query/manipulation language  Should complex (algorithmic) operations be part of a database benchmark?  Which algorithms to pick?  Social Network Analysis → Find communities  Supply Chain Management → Find maximal flow  Web of Data → Find pattern matches  How are concurrent users represented?  What about transactionality? Marcus Paradies | FOSDEM 2012 | 16
  • 18. > Thoughts on Graph Data Generation Marcus Paradies | FOSDEM 2012 | 17
  • 19. > Graph Data Generation - Patterns  Understanding graph patterns (characteristics) is crucical for a good graph data generator  What are distinguishing characteristics of graphs?  How can we identify graph patterns on large graphs?  Three main patterns [1]:  Power law distributed  Small diameters  Community Effects ? ? = = Marcus Paradies | FOSDEM 2012 | 18
  • 20. > Pattern 1 – Power law distributed source: [2] source: [2]  Most real-world graph data sets follow a power law distribution  Examples:  Internet router graph  Subsets of the WWW  Citation Graphs Marcus Paradies | FOSDEM 2012 | 19
  • 21. > Pattern 2 – Small Diameters  Effective Diameter (eccentricity): Minimum number of hops, in which a fraction (e.g. 90%) of all connected pairs of nodes can reach each other  Other measures exist as well, but are not applicable to disconnected graphs  In most use cases, diameter is much smaller than the size of the graph  Examples:  97% eccentricity of around 16 for path lengths in the WWW  Average path length around 6 for Epinions social network source: [1] Marcus Paradies | FOSDEM 2012 | 20
  • 22. > Pattern 3 – Community Effects  Community: A set of nodes, where each node in the set is closer to all other nodes in the community than to nodes outside the community.  Communities can be found in many real-world graphs, especially social networks and collaboration networks  Clustering Coefficient C: A measure, which qualifies the „clumpiness“ of a graph Marcus Paradies | FOSDEM 2012 | 21
  • 23. > Thoughts on Query Workload Marcus Paradies | FOSDEM 2012 | 22
  • 24. > Query Workload - Operations  Graph Manipulation Operations  Add/Update/Remove Nodes from the Graph  Add/Update/Remove Edges from the Graph  Add/Update/Remove Edge attributes  Add/Update/Remove Node attributes  Graph Query Operations  Retrieve selection of nodes from given filter expression  Getting the neighbors of a set of nodes (possibly with edge filter constraints)  Graph Traversals  Based on basic query operations  Exploration of neighborhood from a given set of start nodes  Terminated by the number of steps and/or edge/node filter constraints  Graph Analytical Operations  Aggregation operations such as sum, avg, min, max  Aggregations on node-level and on edge-level Marcus Paradies | FOSDEM 2012 | 23
  • 25. > Query Workload - Measures  Closely related to benchmark capabilities  Measures from relational benchmarks apply such as  Average query response time  Transactions per second (throughput)  Additional measures for graph traversals  Traversals per second  What about distributed scenarios?  What about concurrent users? Marcus Paradies | FOSDEM 2012 | 24
  • 26. > Summary and Outlook  Graph data distribution highly important for graph database benchmark  Application domains do have very specific graph characteristics  A graph database benchmark has to provide abstract and high-level graph operation descriptions  Feel free to contact me if you want to contribute: marcus.paradies@gmail.com Marcus Paradies | FOSDEM 2012 | 25
  • 27. > Discussion Marcus Paradies | FOSDEM 2012 | 26
  • 28. > Theses  A benchmark based on social network data is nice, but might be not be that representative for large enterprise applications  Algorithms should NOT be part of a graph database benchmark  Only support basic operations such as simple lookups and path traversals  The underlying graph data model should be a simple property graph  A graph database has to scale in terms of data size as well as number of concurrent users  .... Marcus Paradies | FOSDEM 2012 | 27
  • 29. > References [1] Graph Mining: Laws, Generators, and Algorithms (2006) [2] http://konect.uni-koblenz.de/ [3] A Discussion on the Design of Graph Database Benchmarks (2010) Marcus Paradies | FOSDEM 2012 | 28