LOD2: State of Play WP2 - Storing and Querying Very Large Knowledge Bases

WP2Storing and Querying Very Large Knowledge Bases Peter Boncz

LOD Cloud Hosted on the Knowledge Store Cluster * major performance increases in OpenLink ,[object Object], * BSBM-v3 / BSBM-BI * LOD2 Benchmark Auditing Service (BAS) * new benchmarks: RDF-H + Social Intelligence Benchmark (SIB) ,[object Object], * dynamic cluster repartitioning * integration MonetDB-OpenLink * caching intermediates * graph path processing * entity ranking * geo

WP2 Storing and Querying Very Large Knowledge Bases Goal: enabling large-scale, feature-rich & enterprise-ready Linked Data management solutions Database Partners in LOD2: CWI: Leading open sourceanalytics RDBMS OpenLink: LeadingLinked data deployment platform TechnologicalExcellence: Creating and publishing metrics for choosing RDF solutions Bringing Column Store Technology for Business Intelligence on RDF Ground-breaking database innovations for RDF stores (Dynamic Query optimization, Adaptive Caching of Joins, Optimized Graph Processing, Cluster/Cloud scalability)

WP2 Linked Open Data for real in your Apps Business Advantages: Enrichyourapplicationwith (free & rich) Linked Open Data RDF store technology has 10x lowerdeploymentcoststhanrelational for ragged data TechnologicalFlexibility: DeliverSchema-LastFlexibility and Inference at Relational Data WarehouseCost and Performance Grow as you go: the LOD2 platform dynamicallyadapts to yourusagepatterns and structure of your data Integrate, resolve, alignanything: Schema, instanceidentity Rich Features for complex Applications: Advanced SPARQL and SQL query processing SPARQL and SQL Federation Full Text, Geospatial, Text Search Scale-Outon Clusters, Replication

LOD CloudDeliverables (OGL) D2.1.1 Initial (M3) original targets: D2.1.3 Intermediate (M15) 50B triples D2.1.4 Intermediate2 (M27) 250B triples D2.1.5 Final (M39) 1T triples Activities: ,[object Object], making it faster (adopting MonetDB principles) * column store and compression * vectored execution introducing multi-core features ,[object Object],getting more data sets crawling the web, NLP extraction of data benchmarking with synthetic data sets

Task 2.1: State of the Art, Evaluation & Benchmarking This task reviews the state of the art in RDF and relational analytics databases and creates a laboratory with the leading products of both categories installed. This can serve as a testing and benchmarking resource for constantly measuring the project's progress against the baseline of the best in the market. Benchmarking in LOD2 serves two purposes: measuring the relative cost of RDF versus equivalent relational functionality and measuring RDF performance in applications which are RDF's home terrain, e.g. integration of highly heterogeneous, "ragged" content with alignment at preprocessing/run time by rules and machine learning approaches. For the first case, we can use TPC H and its star schema derivative (SSBM). For the second case, new benchmarks need to be developed, encompassing different functionality.

Task 2.1: State of the Art, Evaluation & Benchmarking The benchmarks will be developed primarily during the first year, with work on integration quality metrics extending over the second year. The benchmarks will be run and results published at each milestone of the project. Huge data size scalability (e.g. trillion triples) is expected to require a cluster, most feasibly temporary deployment in a cloud system, and the goal of the DB work in LOD2 is to reduce the cost of deployment as much as possible, by devising techniques that reduce the memory requirements of large RDF deployments. We currently envision Oracle 11g R2, BigOWLIM, YARS, Vertica, AllegroGraph, VectorWise and MonetDB to be deployed in the LOD2 benchmarking laboratory. As benchmarks we envision TPC-H, LUBM, UOMB, BSBM, SP2Bench and, SSBM; and as described above propose the creation of a new benchmark patterned after social networking data.

Task 2.1: State of the Art, Evaluation & Benchmarking D1.2 State of the Art Analysis (M3) Held a survey among RDF engine vendors (Jena TDB/SDB, 4Store, BigOWLIM, OpenLink Virtuoso) Established contacts for future benchmarking activities. D2.1.2 State of the Art LOD Laboratory (M6) Installed engines at two sites (FUB, CWI) Ran initial experiments on BSBMv3 To follow: ,[object Object]

Social Intelligence Benchmark ,[object Object]

BSBM V3 ResultsBenchmarked at FUB: 4store 1.1.2 Garlik http://4store.org/ BigData r4169 SYSTAP LLC http://www.systap.com/bigdata.htm BigOwlim 3.4.3129 OntoText http://www.ontotext.com/owlim/ Jena TDB 0.8.9 openjena.org http://www.openjena.org/TDB/ Fuseki 0.1.0 openjena.org http://openjena.org/wiki/Fuseki Virtuoso 7.0 OpenLink http://virtuoso.openlinksw.com/ Main new conclusions: we ran into several technical problems for BI. To give the store vendors time to fix and optimize their stores we considered running the tests again in about three or four months. For the next test runs we will also modify query 4, because of its quadratic complexity and therefore bad scalability characteristics.

LOD2 Benchmark Auditing Service Benchmarking needs of SPARQL engine vendors: ,[object Object]

using new or upcoming releases (not yet public)

using properly tuned settings and hardware to their solution

yet need credibility (is it fair)Tournaments organized by one institution have ,[object Object]

not the right hardware or settings

may become a legal liability once matters become more seriousLOD2 should reach out to the SPARQL technical community and provide independent benchmark auditing services ,[object Object]

maybe other benchmarks later,[object Object]

queries not always properly balanced / weights thought out

LOD2: State of Play WP2 - Storing and Querying Very Large Knowledge Bases

Recommended

Recommended

More Related Content

Similar to LOD2: State of Play WP2 - Storing and Querying Very Large Knowledge Bases

Similar to LOD2: State of Play WP2 - Storing and Querying Very Large Knowledge Bases (20)

More from LOD2 Creating Knowledge out of Interlinked Data

More from LOD2 Creating Knowledge out of Interlinked Data (20)

LOD2: State of Play WP2 - Storing and Querying Very Large Knowledge Bases