Assessing Linked Data Versioning Systems: The Semantic Publishing Versioning Benchmark

Assessing Linked Data Versioning Systems:
The Semantic Publishing Versioning
Benchmark
Irini Fundulaki
Vassilis Papakonstantinou and Giorgos Flouris
Institute of Computer Science
Foundation for Research and Technology
Greece
1

Versioning in the Web
•  Data and schema of Linked Open Datasets is constantly evolving
with dynamicity being an indispensable part of the LOD
•  Changes typically happen without any warning, centralized
monitoring, or reliable notification mechanism
•  Need to keep track of the different versions of the datasets to
ensure the quality and traceability of Web data

Semantic Web Technologies for Health Data Management 2018 2
Versioning: creation and management of the
changes (deletion, addition, modification)
of a dataset
…

Benchmarking Versioning Systems
•  Versioning Benchmark should test how different systems
behave with respect to
–  the space required by the multi-version repository and
–  the efficiency of retrieving different versions and answering
queries
•  Semantic Publishing Versioning Benchmark (SPVB)
–  scalable benchmark, fully configurable, independent of any
versioning strategy or system
–  Produces realistic BBC data in conjunction with DBpedia data.
–  follows a choke-point based design
•  the set of technical difficulties that force systems to
improve their performance
12th International Workshop on Scalable Semantic Web Knowledge Base Systems 3

LDBC Semantic Publishing Benchmark (SPB) 2.0
•  Inspired by Dynamic Semantic Publishing, continuously used at
BBC Sport
–  Synthetic, deterministic and scalable benchmark
–  Based on real BBC ontologies, DBpedia and Geonames
ontologies
–  Generated datasets simulate the activity of a publishing
organization for a speciﬁc time period
–  Models 3 types of relations in data
•  Clustering of data
•  Correlations of entities
•  Random tagging of entities

SPVB: Choke-Point Based Design
•  VCP1 (Storage Space)
–  eﬃcient management of storage space
•  VCP2 (Partial Version Reconstruction)
–  reconstruction of the part of a version required for query
answering
•  VCP3 (Parallel Version Reconstruction)
–  parallel version reconstruction for delta-based and hybrid
systems
•  VCP4 (Parallel Delta Computation)
–  parallel computation of deltas
•  VCP5 (On Delta Evaluation)
–  query evaluation for delta-based systems

SPVB: Key Performance Indicators
1.  Correctness
–  The proportion of SPARQL queries answered correctly
2.  Initial Version Ingestion Speed (triples per second)
–  Number of triples that can be loaded per second for the initial
version
3.  Applied Changes Speed (in changes per second)
–  Average number of changes that can be stored per second
4.  Storage space cost (in MB)
–  Total storage space required for storing all versioned data
5.  Average Query Execution Time (in ms)
6.  Throughput (queries per second)
–  Measures the number of queries that can be answered per
second for all query types

SPVB: Types of Versioning Queries
•  2 Dimensions:
–  Focus: refers to time - present (modern) or past (historical)
–  Type: refers to what we are querying
•  whole version (materialization), single-version, cross-version
12th International Workshop on Scalable Semantic Web Knowledge Base Systems 7 12th International Workshop on Scalable Semantic Web Knowledge Base Systems
Focus
Version
Modern
Materialization
Single-Version Structured Queries
Historical
Materialization
Single-Version Structured Queries
Delta
Materialization
Single-Delta Structured Queries
Cross-delta structured queries
Cross-version structured queries

SPVB: Query Types
Title Explanation
QT1 Modern version
materialization
queries ask for the full current
version to be retrieved
QT2 Modern single-version
structured queries
queries performed in
the current version of the data
QT3 Historical version
materialization
queries ask for a full past
version
QT4 Historical single-version
structured queries
queries performed in a
single past version

SPVB: Query Types
Title Explanation
QT5 Delta materialization queries ask for a full delta between
versions
QT6 Single-delta structured
queries
queries performed on changes of
two consecutive versions
QT7 Cross-delta structured
queries
queries performed on changes of
several versions
QT8 Cross-version structured
queries
queries ask for information that
appear in more than one versions

SPVB: Architecture
SPB Data
Generator
V0 V1 Vn …
added triples
V0 V1 V3 V2 V4
evenly distributed
BBC
Ontologies
5 DBpedia Versions for 1000 BBC entities
Virtuoso Triple
Store
generated
data
Task Provider
Evaluation
Storage
Benchmarked
System
deleted triples
SPARQL Queries
Expected
results
Results
SPARQL
Queries
Evaluation
Module
Data Generator
Expected
results

SPVB: Data Generation (1)
•  Generation of versions that contain realistic data and real
DBpedia data
•  Generation of benchmarking tasks (SPARQL queries)
•  Computation of expected results
•  Conﬁguration Parameters
1.  Data generation seed
2.  Initial version size
3.  Number of versions
4.  Version insertion ratio (%)
5.  Version deletion ratio (%)
6.  Generated data form: Independent Copies (IC), Delta or
ChangeSets (CS), Both (IC + CS)

SPVB: Generation of Synthetic, Realistic Data (2)
SPB Data
Generator
V0 V1 Vn …
added triples
V0 V1 V3 V2 V4
evenly distributed
BBC
Ontologies
5 Dbpedia Versions for 1000 BBC entities
Virtuoso Triple
Store
generated
data
Task Provider
Evaluation
Storage
Benchmarked
System
deleted triples
SPARQL
Queries
Expected results Expected
results
Results
SPARQL
Queries
Evaluation
Module

SPVB: Generation of Synthetic, Realistic Data (3)
•  Use of the SPB data generator that produces RDF descriptions
of BBC creative works that store metadata about real entities
•  Generated datasets simulate the activity of a publishing
organization for a speciﬁc time period
•  Models 3 types of relations in data
–  Clustering of data
–  Correlations of entities
–  Random tagging of entities
“David Bowie leads
Lou Reed tribute”
&cw1
“David Bowie leads tribute to
‘master’ Lou Reed”
dbpedia:David_Bowie
dbpedia:Lou_Reed
;tle
shortTitle
men;ons
about

SPVB: DBpedia Data (4)
SPB Data
Generator
V0 V1 Vn …
added triples
V0 V1 V3 V2 V4
evenly distributed
BBC
Ontologies
Virtuoso Triple
Store
generated
data
Task Provider
Evaluation
Storage
Benchmarked
System
deleted triples
SPARQL
Queries
results
Results
SPARQL
Queries
Evaluation
Module

SPVB: Generation of DBpedia Data (5)
•  Real DBpedia data
–  5 versions of DBpedia (2012 – 2016) integrated in SPVB
–  The 1000 most important entities (according to a score provided
by SPB) used for creative work annotation
–  The DBpedia subgraphs for those entities compose each version
–  DBpedia versions are “equally distributed” to the total produced
one
&cw1
dbpedia:David_Bowie
dbpedia:Lou_Reed
men$ons
about
Integration of DBpedia versions & SPB datasets is done
by means of SPB Creative Works’ about & mentions
properties that are references to DBpedia

Task Generation (1)
SPB Data
Generator
V0 V1 Vn …
added triples
V0 V1 V3 V2 V4
evenly distributed
BBC
Ontologies
Virtuoso Triple
Store
generated
data
Task Provider
Evaluation
Storage
Benchmarked
System
deleted triples
SPARQL
Queries
results
Results
SPARQL
Queries
Evaluation
Module

Task Generation (2)
•  Support for 8 Query Types (QT)
•  For each query type one or more query templates are deﬁned
based on SPB query templates
–  Each version is stored in a diﬀerent named graph
–  Template contains placeholders of the form {{{placeholder}}}
•  refers to the queried version
•  refers to an IRI from DBpedia

SELECT DISTINCT ?creativeWork ?v1
FROM {{{graphVhistorical}}}
WHERE {
?creativeWork cwork:about
{{{cwAboutUri}}} .
{{{cwAboutUri}}} rdf:type ?v1 .
}

Task Generation (2)
•  For the query types QT2, QT4, QT8 we use 6 of the 25 DBpedia
SPARQL Benchmark (DBPSB) Query Templates
•  The templates selected do not return empty results when
considering the integrated DBpedia data

WHERE {
{{{cwAboutUri}}} .
}

Task Generation (3)
•  Placeholder replacement:
–  queried version: wide range of available versions is covered
•  IRI from Dbpedia
–  Same placeholders used in the DBPSB query templates
–  Queries are produced by replacing placeholders with values/
variables
–  randomly pick one of 1000 concrete values

WHERE {
{{{cwAboutUri}}} .
}

FROM <http://graph.version.1>
WHERE {
{{{cwAboutUri}}} .
}

FROM <http://graph.version.1>
WHERE {
?creativeWork cwork:about dbpedia:David_Bowie .
dbpedia:David_Bowie rdf:type ?v1 .
}

Experiments (1)
•  Benchmarked systems:
–  R43ples
–  Virtuoso
•  Experimental setup
–  3 datasets of diﬀerent initial size
•  100K, 500K 1M triples
–  5 diﬀerent versions
–  Timeout of 1 hour
•  Baseline: Virtuoso with full materialization

Experiments (2): Virtuoso
•  Initial version ingestion speed outperforms the applied changes
speed
–  Overhead of the chosen versioning strategy (full materialization)
–  Unchanged information between versions is duplicated
•  Signiﬁcant overhead on storage space is due to the versioning
strategy used

Experiments (3): Virtuoso
•  Execution times are short (based on the data size), as all the
versions are already materialized in the triple store

Experiments (4): R43ples
•  Only managed to run experiments for the 100K triples dataset
•  Changes are applied slower than the triples of the initial version are loaded
–  Current version kept materialized
•  Many queries failed to return the correct results
•  Response times are order(s) of magnitude slower than Virtuoso
Metric Result
V0 Ingestion speed (triples/sec) 3502.39
Changes speed (changes/sec) 2767.56
Storage Cost (MB) 197378
Throughput (queries/sec) 0.09
Queries failed
Metric Result Succeeded Queries
QT1 (ms) 13887.33 0/1
QT2 (ms) 146.28 25/30
QT3 (ms) 18265.78 0/3
QT4 (ms) 11681.49 13/18
QT5 (ms) 31294.00 0/4
QT6 (ms) 12299.58 4/4
QT7 (ms) 35294.33 2/3
QT8 (ms) 19177.33 30/36

Conclusions
•  Using multiple Data Generator components to parallelize data
generation
•  Make query workload more conﬁgurable
•  Include or exclude speciﬁc query types
•  Graphically visualize KPIs
•  Experiment with larger number of versioning systems

This work was supported by grands from the EU H2020 Framework Programme
provided for the project HOBBIT (GA no. 688227).

Assessing Linked Data Versioning Systems: The Semantic Publishing Versioning Benchmark

More Related Content

More from Holistic Benchmarking of Big Linked Data

Recently uploaded

Assessing Linked Data Versioning Systems: The Semantic Publishing Versioning Benchmark