LOD2 Plenary Vienna 2012: WP2 - Storing and Querying Very Large Knowledge Bases

  • 626 views
Uploaded on

State of Play presentation at the LOD2 Plenary Vienna 2012: WP2 - Storing and Querying Very Large Knowledge Bases by Peter Boncz of CWI.

State of Play presentation at the LOD2 Plenary Vienna 2012: WP2 - Storing and Querying Very Large Knowledge Bases by Peter Boncz of CWI.

More in: Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
626
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • From the aforementioned reasons, we proposed an RDF and graph database benchmark, called Social Intelligence benchmark, that can exploit the advantages of RDF in graph representation. We are aiming at testing the graph database performance on a highly connected graph. As social network is a high profile for graph data management, we design our benchmark over the scenarios of a social network. We try to generate data as realistic as possible with correlations and offer challenging queries over the data correlations.Besides, since a very large amount of useful information is available in many linked-open datasets, we exploit these resources by linking to them.
  • Now, I will describe the data specification of SIB. As Facebook is the most popular social network with more than 800 millions active users, we take the schema style of Facebook as the baseline for designing SIB. For generating realistic data, we use 14 dictionaries that we build from real data. These dictionaries cover various domains, for example, geographical information, personal names,..SIB data is designed so that it can simulate realistic scenario including the real behaviors of the users and the characteristics of data distributions in social networks.As we mention before, our synthetic data is linked with well-known linked open data. And here, SIB is linked with DBPedia, one of the largest linked open dataset.
  • I think most of us know FB and even have a Facebook account. The logical schema of our benchmark simulates the Facebook schema in which a user can have many friends, and there are friendships between them. A user can provide many profile information such as his name, where he is studying at, where he is living at. He can also specify his current status, for example, in Relation ship with another user. The user can upload many photo, start a discussion by writing posts, and get a lot of comments from his friends.

Transcript

  • 1. SIB . 23.03.2011 . Page 1 http://lod2.euWP2Storing and QueryingVery Large Knowledge Bases Vienna Update March 2012 – M18 Peter Boncz http://lod2.eu
  • 2. SIB . 23.03.2011 . Page 2 http://lod2.eu Table of Contents • WP2 Refresher • LOD Cloud Hosted on the Knowledge Store Cluster * 50B mark reached, column-store Virtuoso deployed • State of the Art LOD Laboratory (“Benchmarking”) * LDBC – RDF Store Industry council * BSBM at large scale * RDF-H + Social Intelligence Benchmark (SIB) • Technical work * column-store Virtuoso  cluster version * recycling query results • Next up * LOD cloud @250B triples * Virtuoso: adaptive query optimizer (and more) * first MonetDB/SPARQL version (RDF clustering, graph indexing)
  • 3. LOD2 Title . 02.09.2010 . Page 3 http://lod2.eu WP2 Organization CWI (MonetDB): • Peter Boncz (also in VUA group of Frank v Harmelen) • Duc Pham Minh (Phd student) • Irini Fundulaki (1-year sabbatical from FORTH) OpenLink (Virtuoso): • Orri Erling • Hugh Williams • Ivan Mikhailov + FU Berlin (BSBM) + DERI (BSBM text+ LOD cloud + text retrieval/sindice) + ULEI (DBpedia benchmark)
  • 4. SIB . 23.03.2011 . Page 4 http://lod2.eu WP2 Storing and Querying Very Large Knowledge BasesGoal: enabling large-scale, feature-rich & enterprise-ready Linked Data management solutionsDatabase Partners in LOD2:CWI: Leading open source analytics RDBMSOpenLink: Leading Linked data deployment platformTechnological Excellence:Creating and publishing metrics for choosing RDF solutionsBringing Column Store Technology for Business Intelligence on RDFGround-breaking database innovations for RDF stores (Dynamic Query optimization, Adaptive Caching of Joins, Optimized Graph Processing, Cluster/Cloud scalability)
  • 5. LOD2 Title . 02.09.2010 . Page 5 http://lod2.eu Task 2.1: State of the Art, Evaluation & Benchmarking LOD cloud cache scalability • M0: 20B triples • M12: 50B triples • M24: 250B triples • M36: 1T triples D2.4 completed: 50B triples in LOD cache @ DERI First deployment of Virtuoso7 Cluster • Currently hosting about 55 billion triples • 8 node Virtuoso v7 (column store) Cluster • 384GB RAM • 2TB Disk Storage • 14B/quads, excl literals Next up: • hardware provisioning for 250B and 1T triples (need 512GB RAM resp. 2TB RAM somewhere)
  • 6. LOD2 Title . 02.09.2010 . Page 6 http://lod2.eu Task 2.1: State of the Art, Evaluation & Benchmarking Benchmarking • creating new benchmarks • BSBM-BI (FU Berlin) • DBpedia Benchmark (ULEI) – best paper award • RDF-H (OGL,CWI) • Social Intelligence Benchmark (OGL,CWI) • running benchmark evaluations • BSBM on a large cluster cluster (Lisa @ SARA) • BSBM on large single-server (40cores, 1TB RAM) • creating industry consensus • Benchmark Auditing Service • LOD Benchmark Council
  • 7. LOD2 Title . 02.09.2010 . Page 7 http://lod2.eu BSBM Large Scale Experiments (still ongoing..) New Aspects: • The Business Intelligence Use Case (BI) • Benchmark Rules • BSBM V3 Results • trying cluster versions SARA LISA cluster • experiments with up to 64 nodes VectorWise high-end server • 40-core machine with 1TB RAM Benchmarked at SARA and Vectorwise 4store 1.1.2 Garlik http://4store.org/ BigData r4169 SYSTAP LLC http://www.systap.com/bigdata.htm BigOwlim 3.4.3129 OntoText http://www.ontotext.com/owlim/ Jena TDB 0.8.9 openjena.org http://www.openjena.org/TDB/ Fuseki 0.1.0 openjena.org http://openjena.org/wiki/Fuseki Virtuoso 7.0 OpenLink http://virtuoso.openlinksw.com/
  • 8. LOD2 Title . 02.09.2010 . Page 9 http://lod2.eu Social Intelligence Benchmark 14 dictionaries of real dataFacebook schema style Realistic scenario simulation Synthetic Generated Data Linked Open Data
  • 9. LOD2 Title . 02.09.2010 . Page 11 http://lod2.eu Technical Work: Recycling (D2.4) Dynamic caching of intermediate query results • SPARQL problem: hard to index workload / expensive backward chaining Idea: compute once, re-use many times
  • 10. LOD2 Title . 02.09.2010 . Page 13 http://lod2.eu Technical Work: Virtuoso 7 Major now upcoming release V7, due for release in 2012 • column store technology: • aggressive compression  more data fits in RAM • vectored execution  things run faster • elastic cluster implementation • partitions can migrate across nodes • bringing computation to the data • arbitrary recursive functions in the cluster • geospatial support • full openGIS support, R-tree backed, EWKT format • future enhancements • adaptive query optimization (CWI ROX) •re-use of intermediates (CWI recycling) • using SSDs as cache
  • 11. LOD2 Title . 02.09.2010 . Page 14 http://lod2.eu Next 6 months Virtuoso: sampled query optimizer • query optimization in SPARQL is difficult (no stats) • use adaptive, run-time, query optimization with sampling MonetDB and SPARQL • First version in sight (cooperation with FORTH) • research tracks • RDF clustering on Characteristic Sets • correlated join path indexing LOD cache at 250B triples • what triples to use? • what hardware to use? (need 512GB RAM)
  • 12. SIB . 23.03.2011 . Page 15 http://lod2.eu Contact Address Centrum Wiskunde Informatica (CWI) Science Park 123 1098 XG Amsterdam The Netherlands monetdb.cwi.nlThanks for your attention!
  • 13. LOD2 Title . 02.09.2010 . Page 16 http://lod2.eu LOD2 Benchmark Auditing Service Benchmarking needs of SPARQL engine vendors: • vendors want to publish in their own timescale • using new or upcoming releases (not yet public) • using properly tuned settings and hardware to their solution • yet need credibility (is it fair) Tournaments organized by one institution have • bad timing, wrong version, one more bug to fix, etc • not the right hardware or settings • may become a legal liability once matters become more serious LOD2 should reach out to the SPARQL technical community and provide independent benchmark auditing services • start with BSBM  working on Auditing Rules Document • maybe other benchmarks later