FedBenchA Benchmark Suite forFederated Semantic Data ProcessingMichael Schmidt1, Olaf Görlitz2, Peter Haase1, Günter Ladwi...
Linked Data Evaluation Strategies         Query        Central       RepositoryRDF       RDF       RDFData      Data      ...
Linked Data Evaluation Strategies         Query                      Query                              Federation Layer  ...
Centralized vs. Federated ApproachesCentralized Processing                    Federated Processing•  Data periodically cra...
Centralized vs. Federated ApproachesCentralized Processing                      Federated Processing•  Data periodically c...
Benchmarking Linked Data Evaluation             Query                        Query                                    Fede...
Challenges in Federated Linked Data Benchmarking:      Heterogeneity of Use CasesData level                          Query...
Challenges in Federated Linked Data Benchmarking:      Heterogeneity of Use CasesData level                               ...
FedBench Components (ctd)Data Sets•  Vary in structuredness,   domain, size, etc.•  Grouped in collections
Data Collections               Cross-Domain CollectionLife Science Collection       SP2Bench Data Collection              ...
FedBench Components (ctd)Data Sets                    Queries•  Vary in structuredness,   •  Operate on the data   domain,...
Example QueryList all US presidents including their party and associated news.    SELECT ?pres ?party ?page    WHERE {    ...
Queries¨    Partially taken from prototype systems, partially designed      to capture challenges in federated query proc...
QueriesOperators:          A – AND, U – UNION, O – OPTIONAL, F – FILTERSolution Modifiers: Or – ORDER BY, D – DISTINCT, L ...
Queries
FedBench Components (ctd)  Data Sets                       Queries  •  Vary in structuredness,      •  Operate on the data...
Evaluation Framework¨  Parametrizable benchmark driver¨  Implemented in Java using the Sesame framework¨  Highly custom...
FedBench Components (ctd)  Data Sets                       Queries  •  Vary in structuredness,      •  Operate on the data...
FedBench Components (ctd)  Data Sets                       Queries  •  Vary in structuredness,      •  Operate on the data...
Evaluation¨    Goal: prove practicability & flexibility of benchmark      ¤  Cover  a variety of scenarios      ¤  Asse...
Evaluation: Scenario A¨    “Centralized vs. Federated“ query processing      ¤  Scenario   A1: Centralized processing   ...
Scenario A: Life Science Queries                 Data size: 50M triples in total#Requests to Endpoints            LS1     ...
Evaluation: Scenario B   ¨    Scenario B: Linked Data query set on CD collection         ¤  Bottom-upapproach         ¤...
Summary: Central Findings   ¨  Effective join ordering often impossible when no       intelligent source selection strate...
Conclusion¨  Benchmark flexible enough to cover a wide range    of semantic data use cases/applications¨  Evaluation rev...
Questions ?    http://code.google.com/p/fbench/
Upcoming SlideShare
Loading in …5
×

Fedbench - A Benchmark Suite for Federated Semantic Data Processing

1,400 views

Published on

Published in: Technology, Education
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,400
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
36
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Fedbench - A Benchmark Suite for Federated Semantic Data Processing

  1. 1. FedBenchA Benchmark Suite forFederated Semantic Data ProcessingMichael Schmidt1, Olaf Görlitz2, Peter Haase1, Günter Ladwig3,Andreas Schwarte1, Thanh Tran3 1 2 3 10th Intl. Semantic Web Conference, Oct 26, 2011, Bonn
  2. 2. Linked Data Evaluation Strategies Query Central RepositoryRDF RDF RDFData Data Data CentralizedLinked Data Processing
  3. 3. Linked Data Evaluation Strategies Query Query Federation Layer Dynamic Central HTTP Local SPARQL SPARQL Repository Rep. Endp. Endp. LookupsRDF RDF RDF RDF RDF RDFData Data Data Data Data Data Centralized FederatedLinked Data Processing Linked Data Processing
  4. 4. Centralized vs. Federated ApproachesCentralized Processing Federated Processing•  Data periodically crawled, gathered, •  Use of original data sources ensures and updated that data is always „up-to-date“•  High reliability and controllability •  No control over federation members•  Inflexible set of data sources •  Ad-hoc integration of remote sources•  Comprehensive knowledge about data, •  Requires careful optimization, but also useful for query optimization offers opportunities (parallelization)
  5. 5. Centralized vs. Federated ApproachesCentralized Processing Federated Processing•  Data periodically crawled, gathered, •  Use of original data sources ensures and updated that data is always „up-to-date“•  High reliability and controllability •  No control over federation members•  Inflexible set of data sources •  Ad-hoc integration of remote sources•  Comprehensive knowledge about data, •  Requires careful optimization, but also useful for query optimization offers opportunities (parallelization)Key Observations(1)  Both centralized and federated Linked Data processing have practical use cases(2)  Radically different requirements, challenges, and characteristics
  6. 6. Benchmarking Linked Data Evaluation Query Query Federation Layer Dynamic Central HTTP Local SPARQL SPARQL Repository Rep. Endp. Endp. Lookups RDF RDF RDF RDF RDF RDF Data Data Data Data Data Data Centralized Federated Linked Data Processing Linked Data ProcessingBSBM, LUBM, SP2Bench, ... So far no benchmarks proposed
  7. 7. Challenges in Federated Linked Data Benchmarking: Heterogeneity of Use CasesData level Query level¨  (D1) Physical Distribution ¨  (Q1) Query Language ¤  Local vs. remote ¤  Expressiveness¨  (D2) Data Access Interfaces ¤  Complexity ¤  Native repository ¨  (Q2) Result Completeness ¤  SPARQL Endpoint ¨  (Q3) Ranking ¤  Linked Data (HTTP) ¨  Various other characteristics¨  (D3) Knowledge about Data ¤  Join types Source Existence ¤  Result size¨  (D4) Data Statistics ¤  ...
  8. 8. Challenges in Federated Linked Data Benchmarking: Heterogeneity of Use CasesData level Query level¨  (D1) Physical Distribution ¨  (Q1) Query Language ¤  Local vs. remote ¤  Expressiveness¨  (D2) Data Access Interfaces ¤  Complexity ¤  Native repository ¨  (Q2) Result Completeness ¤  SPARQL Endpoint ¨  (Q3) Ranking ¤  Linked Data (HTTP) ¨  Various other characteristics¨  (D3) Knowledge about Data ¤  Join types Source Existence ¤  Result size¨  (D4) Data Statistics ¤  ... Need for a flexible benchmark suite rather than “one-size-fits-all“ benchmark scenario!
  9. 9. FedBench Components (ctd)Data Sets•  Vary in structuredness, domain, size, etc.•  Grouped in collections
  10. 10. Data Collections Cross-Domain CollectionLife Science Collection SP2Bench Data Collection •  Synthetic Data •  Split into sub-datasets according to types
  11. 11. FedBench Components (ctd)Data Sets Queries•  Vary in structuredness, •  Operate on the data domain, size, etc. collections•  Grouped in collections •  Logically grouped
  12. 12. Example QueryList all US presidents including their party and associated news. SELECT ?pres ?party ?page WHERE { ?pres rdf:type dbpedia-owl:President . ?pres dbpedia-owl:nationality dbpedia:United_States . ?pres dbpedia-owl:party ?party . ?x nytimes:topicPage ?page . ?x owl:sameAs ?pres }
  13. 13. Queries¨  Partially taken from prototype systems, partially designed to capture challenges in federated query processing¨  Four sets of queries ¤  Life Science n  Life Science query set (full SPARQL): 7 queries (LS) ¤  Cross Domain n  Cross Domain query set (full SPARQL): 7 queries (CD) n  Linked Data query set (BGPs): 11 queries (LD) ¤  SP2Bench n  SP2Bench query set (full SPARQL): 14 queries (SP)¨  Focus on different functional aspects ¤  General federated query processing requirements ¤  Pure Linked Data processing
  14. 14. QueriesOperators: A – AND, U – UNION, O – OPTIONAL, F – FILTERSolution Modifiers: Or – ORDER BY, D – DISTINCT, L – LIMIT, Of – OFFSET
  15. 15. Queries
  16. 16. FedBench Components (ctd) Data Sets Queries •  Vary in structuredness, •  Operate on the data domain, size, etc. collections •  Grouped in collections •  Logically grouped Benchmark Driver•  Allows to execute FedBench in a unified way•  Java, Open Source à easily adjustable and extensible
  17. 17. Evaluation Framework¨  Parametrizable benchmark driver¨  Implemented in Java using the Sesame framework¨  Highly customizable via config files ¤  Data and query sets ¤  Number of runs, timeouts ¤  Deployment method of data sets ¤  Metrics (loading time, evaluation time, #requests)¨  Highly extendable, which makes it easy to connect new systems on demand
  18. 18. FedBench Components (ctd) Data Sets Queries •  Vary in structuredness, •  Operate on the data domain, size, etc. collections •  Grouped in collections •  Logically grouped Benchmark Driver•  Allows to execute FedBench in a unified way•  Java, Open Source à easily adjustable and extensible Benchmark ResultsCSV RDF
  19. 19. FedBench Components (ctd) Data Sets Queries •  Vary in structuredness, •  Operate on the data domain, size, etc. collections •  Grouped in collections •  Logically grouped Benchmark Driver•  Allows to execute FedBench in a unified way•  Java, Open Source à easily adjustable and extensible Benchmark Results •  Wiki-based platform for Linked DataCSV RDF •  Publishing and discussion of Publishing benchmark results
  20. 20. Evaluation¨  Goal: prove practicability & flexibility of benchmark ¤  Cover a variety of scenarios ¤  Assess first state-of-the-art results ¤  Identify weaknesses and strengths of systems¨  Measures ¤  Queryevaluation time ¤  Number of requests sent to remote sources¨  Hardware ¤  ILO2 HP server ProLiant DL360 ¤  4Core CPU with 2000MHz ¤  64bit Windows Server 2008, running 64bit JVM 1.6.0_22 ¤  32GB RAM (20GB for federation mediator, rest distributed among federation members)
  21. 21. Evaluation: Scenario A¨  “Centralized vs. Federated“ query processing ¤  Scenario A1: Centralized processing n  Sesame 2.3.1 ¤  Scenario A2: Local federation n  Sesame 2.3.1 + AliBaba ¤  Scenario A3: SPARQL Endpoint federation (HTTP) n  Sesame 2.3.1. + AliBaba n  SPLENDID from WeST¨  10min timeout per query¨  Average over three runs (after warm-up phase)
  22. 22. Scenario A: Life Science Queries Data size: 50M triples in total#Requests to Endpoints LS1 LS2 LS3 LS4 LS5 LS6 LS7Endpoint Federation (AliBaba) 13 61 (410) 21k 17k (130) (876)Endpoint Federation (SPLENDID) 2 49 9 10 4778 322 4889
  23. 23. Evaluation: Scenario B ¨  Scenario B: Linked Data query set on CD collection ¤  Bottom-upapproach ¤  Top-down approach ¤  Mixed approach ¨  Local CumulusRDF Linked Data server ¨  Systems: dedicated prototype implementations* ¨  Major findings ¤  Top-down approach most performant ¤  Mixed approach competitive, bringing the merits of earlier result reporting* G. Ladwig, T. Tran: Linked Data Query Processing Strategies. In Proc. ISWC, 2010.
  24. 24. Summary: Central Findings ¨  Effective join ordering often impossible when no intelligent source selection strategy is given ¨  In such cases: often very high number of requests (104+) caused by iterative, nested-loop evaluation strategy of AliBaba ¨  Limited capabilities of Sesame to deal with parallelization cause problems (locking issues)In the following talk:FedX – a federated query processing system that tackles these issues!
  25. 25. Conclusion¨  Benchmark flexible enough to cover a wide range of semantic data use cases/applications¨  Evaluation reveals severe deficiencies of today‘s approaches¨  Upcoming tasks/future work ¤  General SPARQL 1.1 extensions ¤  SPARQL 1.1 federation extensions ¤  Distributed reasoning¨  Laid out as community project: you are invited to contribute with your own data & queries!
  26. 26. Questions ? http://code.google.com/p/fbench/

×