LOD2 Plenary Vienna 2012: WP2 - Storing and Querying Very Large Knowledge Bases

SIB . 23.03.2011 . Page 1 http://lod2.eu

WP2
Storing and Querying
Very Large Knowledge Bases
Vienna Update
March 2012 – M18

Peter Boncz

http://lod2.eu


Table of Contents

• WP2 Refresher
• LOD Cloud Hosted on the Knowledge Store Cluster
* 50B mark reached, column-store Virtuoso deployed
• State of the Art LOD Laboratory (“Benchmarking”)
* LDBC – RDF Store Industry council
* BSBM at large scale
* RDF-H + Social Intelligence Benchmark (SIB)
• Technical work
* column-store Virtuoso  cluster version
* recycling query results
• Next up
* LOD cloud @250B triples
* Virtuoso: adaptive query optimizer (and more)
* first MonetDB/SPARQL version (RDF clustering, graph indexing)

LOD2 Title . 02.09.2010 . Page 3 http://lod2.eu

WP2 Organization

CWI (MonetDB):
• Peter Boncz (also in VUA group of Frank v Harmelen)
• Duc Pham Minh (Phd student)
• Irini Fundulaki (1-year sabbatical from FORTH)

OpenLink (Virtuoso):
• Orri Erling
• Hugh Williams
• Ivan Mikhailov

+ FU Berlin (BSBM)
+ DERI (BSBM text+ LOD cloud + text retrieval/sindice)
+ ULEI (DBpedia benchmark)


WP2
Storing and Querying Very Large Knowledge Bases

Goal: enabling large-scale, feature-rich & enterprise-ready Linked
Data management solutions

Database Partners in LOD2:
CWI: Leading open source analytics RDBMS
OpenLink: Leading Linked data deployment platform

Technological Excellence:
Creating and publishing metrics for choosing RDF solutions
Bringing Column Store Technology for Business Intelligence on RDF
Ground-breaking database innovations for RDF stores
(Dynamic Query optimization, Adaptive Caching of Joins,
Optimized Graph Processing, Cluster/Cloud scalability)


Task 2.1: State of the Art, Evaluation & Benchmarking

LOD cloud cache scalability
• M0: 20B triples
• M12: 50B triples
• M24: 250B triples
• M36: 1T triples

D2.4 completed: 50B triples in LOD cache @ DERI
First deployment of Virtuoso7 Cluster
• Currently hosting about 55 billion triples
• 8 node Virtuoso v7 (column store) Cluster
• 384GB RAM
• 2TB Disk Storage
• 14B/quads, excl literals

Next up:
• hardware provisioning for 250B and 1T triples
(need 512GB RAM resp. 2TB RAM somewhere)


Task 2.1: State of the Art, Evaluation & Benchmarking

Benchmarking

• creating new benchmarks
• BSBM-BI (FU Berlin)
• DBpedia Benchmark (ULEI) – best paper award
• RDF-H (OGL,CWI)
• Social Intelligence Benchmark (OGL,CWI)
• running benchmark evaluations
• BSBM on a large cluster cluster (Lisa @ SARA)
• BSBM on large single-server (40cores, 1TB RAM)
• creating industry consensus
• Benchmark Auditing Service
• LOD Benchmark Council


BSBM Large Scale Experiments (still ongoing..)

New Aspects:
• The Business Intelligence Use Case (BI)
• Benchmark Rules
• BSBM V3 Results
• trying cluster versions

SARA LISA cluster
• experiments with up to 64 nodes

VectorWise high-end server
• 40-core machine with 1TB RAM

Benchmarked at SARA and Vectorwise
4store 1.1.2 Garlik http://4store.org/
BigData r4169 SYSTAP LLC http://www.systap.com/bigdata.htm
BigOwlim 3.4.3129 OntoText http://www.ontotext.com/owlim/
Jena TDB 0.8.9 openjena.org http://www.openjena.org/TDB/
Fuseki 0.1.0 openjena.org http://openjena.org/wiki/Fuseki
Virtuoso 7.0 OpenLink http://virtuoso.openlinksw.com/


Social Intelligence Benchmark

14 dictionaries
of real data
Facebook schema style
Realistic scenario
simulation

Synthetic Generated Data Linked Open Data


Technical Work: Recycling (D2.4)

Dynamic caching of intermediate query results
• SPARQL problem: hard to index workload / expensive backward chaining
Idea: compute once, re-use many times


Technical Work: Virtuoso 7

Major now upcoming release V7, due for release in 2012

• column store technology:
• aggressive compression  more data fits in RAM
• vectored execution  things run faster
• elastic cluster implementation
• partitions can migrate across nodes
• bringing computation to the data
• arbitrary recursive functions in the cluster
• geospatial support
• full openGIS support, R-tree backed, EWKT format
• future enhancements
• adaptive query optimization (CWI ROX)
•re-use of intermediates (CWI recycling)
• using SSDs as cache


Next 6 months

Virtuoso: sampled query optimizer
• query optimization in SPARQL is difficult (no stats)
• use adaptive, run-time, query optimization with sampling

MonetDB and SPARQL
• First version in sight (cooperation with FORTH)
• research tracks
• RDF clustering on Characteristic Sets
• correlated join path indexing

LOD cache at 250B triples
• what triples to use?
• what hardware to use? (need 512GB RAM)


Contact

Address

Centrum Wiskunde Informatica (CWI)
Science Park 123
1098 XG Amsterdam
The Netherlands

monetdb.cwi.nl

Thanks for your attention!


LOD2 Benchmark Auditing Service

Benchmarking needs of SPARQL engine vendors:
• vendors want to publish in their own timescale
• using new or upcoming releases (not yet public)
• using properly tuned settings and hardware to their solution
• yet need credibility (is it fair)

Tournaments organized by one institution have
• bad timing, wrong version, one more bug to fix, etc
• not the right hardware or settings
• may become a legal liability once matters become more serious

LOD2 should reach out to the SPARQL technical community and
provide independent benchmark auditing services
• start with BSBM  working on Auditing Rules Document
• maybe other benchmarks later

LOD2 Plenary Vienna 2012: WP2 - Storing and Querying Very Large Knowledge Bases

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (9)

Similar to LOD2 Plenary Vienna 2012: WP2 - Storing and Querying Very Large Knowledge Bases

Similar to LOD2 Plenary Vienna 2012: WP2 - Storing and Querying Very Large Knowledge Bases (20)

More from LOD2 Creating Knowledge out of Interlinked Data

More from LOD2 Creating Knowledge out of Interlinked Data (20)

Recently uploaded

Recently uploaded (20)

LOD2 Plenary Vienna 2012: WP2 - Storing and Querying Very Large Knowledge Bases

Editor's Notes