Assessing	Linked	Data	Versioning	Systems:	
The	Semantic	Publishing	Versioning	
Benchmark	
Irini	Fundulaki	
Vassilis	Papakonstantinou	and	Giorgos	Flouris	
Institute	of	Computer	Science	
Foundation	for	Research	and	Technology	
Greece	
1
Versioning	in	the	Web	
•  Data	and	schema	of	Linked	Open	Datasets	is	constantly	evolving	
with	dynamicity	being	an	indispensable	part	of	the	LOD	
•  Changes	typically	happen	without	any	warning,	centralized	
monitoring,	or	reliable	notification	mechanism	
•  Need	to	keep	track	of	the	different	versions	of	the	datasets	to	
ensure	the	quality	and	traceability	of	Web	data	
	
Semantic	Web	Technologies	for	Health	Data	Management	2018	 2	
Versioning:		creation	and	management	of	the			
changes	(deletion,	addition,	modification)		
of	a	dataset	
…
Benchmarking	Versioning	Systems	
•  Versioning	Benchmark	should	test	how	different	systems	
behave	with	respect	to		
–  the	space	required	by	the	multi-version	repository	and	
–  the	efficiency	of	retrieving	different	versions	and	answering	
queries		
•  Semantic	Publishing	Versioning	Benchmark	(SPVB)	
–  scalable	benchmark,	fully	configurable,		independent	of	any	
versioning	strategy	or	system	
–  Produces	realistic	BBC	data	in	conjunction	with	DBpedia	data.		
–  follows	a	choke-point	based	design	
•  the	set	of	technical	difficulties	that	force	systems	to	
improve	their	performance	
12th	International	Workshop	on	Scalable	Semantic	Web	Knowledge	Base	Systems	 3
LDBC	Semantic	Publishing	Benchmark	(SPB)	2.0	
•  Inspired	by	Dynamic	Semantic	Publishing,	continuously	used	at	
BBC	Sport	
–  Synthetic,	deterministic		and	scalable	benchmark		
–  Based	on	real	BBC	ontologies,	DBpedia	and	Geonames	
ontologies		
–  Generated	datasets	simulate	the	activity	of	a	publishing	
organization	for	a	specific	time	period	
–  Models	3	types	of	relations	in	data	
•  Clustering	of	data	
•  Correlations	of	entities		
•  Random	tagging	of	entities	
12th	International	Workshop	on	Scalable	Semantic	Web	Knowledge	Base	Systems	 4
SPVB:	Choke-Point	Based	Design	
•  VCP1	(Storage	Space)	
–  efficient	management	of	storage	space	
•  VCP2	(Partial	Version	Reconstruction)	
–  reconstruction	of	the	part	of	a	version	required	for	query	
answering	
•  VCP3	(Parallel	Version	Reconstruction)	
–  parallel	version	reconstruction	for	delta-based	and	hybrid	
systems		
•  VCP4	(Parallel	Delta	Computation)	
–  parallel	computation	of	deltas	
•  VCP5	(On	Delta	Evaluation)	
–  query	evaluation	for	delta-based	systems	
12th	International	Workshop	on	Scalable	Semantic	Web	Knowledge	Base	Systems	 5
SPVB:	Key	Performance	Indicators	
1.  Correctness	
–  The	proportion	of	SPARQL	queries	answered	correctly		
2.  Initial	Version	Ingestion	Speed	(triples	per	second)	
–  Number	of	triples	that	can	be	loaded	per	second	for	the	initial	
version	
3.  Applied	Changes	Speed	(in	changes	per	second)	
–  Average	number	of	changes	that	can	be	stored	per	second		
4.  Storage	space	cost	(in	MB)		
–  Total	storage	space	required	for	storing	all	versioned	data	
5.  Average	Query	Execution	Time	(in	ms)	
6.  Throughput	(queries	per	second)	
–  Measures	the	number	of	queries	that	can	be	answered	per	
second	for	all	query	types	
12th	International	Workshop	on	Scalable	Semantic	Web	Knowledge	Base	Systems	 6
SPVB:	Types	of	Versioning	Queries	
•  2	Dimensions:	
–  Focus:	refers	to	time	-	present	(modern)	or	past	(historical)	
–  Type:	refers	to	what	we	are	querying	
•  whole	version	(materialization),	single-version,	cross-version	
12th	International	Workshop	on	Scalable	Semantic	Web	Knowledge	Base	Systems	 7	12th	International	Workshop	on	Scalable	Semantic	Web	Knowledge	Base	Systems	
Focus	
Version	
Modern	
Materialization	
Single-Version	Structured	Queries	
Historical	
Materialization	
Single-Version	Structured	Queries	
Delta	
Materialization	
Single-Delta	Structured	Queries	
Cross-delta	structured	queries	
Cross-version	structured	queries
SPVB:	Query	Types	
12th	International	Workshop	on	Scalable	Semantic	Web	Knowledge	Base	Systems	 8	
Title	 Explanation	
QT1	 Modern	version	
materialization	
queries	ask	for	the	full	current	
version	to	be	retrieved	
QT2	 Modern	single-version	
structured	queries	
queries	performed	in	
the	current	version	of	the	data	
QT3	 Historical	version	
materialization	
queries	ask	for	a	full	past	
version	
QT4	 Historical	single-version	
structured	queries	
queries	performed	in	a	
single	past	version
SPVB:	Query	Types		
12th	International	Workshop	on	Scalable	Semantic	Web	Knowledge	Base	Systems	 9	
Title	 Explanation	
QT5	 Delta	materialization	 queries	ask	for	a	full	delta	between	
versions	
QT6	 Single-delta	structured	
queries	
queries	performed	on	changes	of	
two	consecutive	versions	
QT7	 Cross-delta	structured	
queries	
queries	performed	on	changes	of	
several	versions	
QT8	 Cross-version	structured	
queries	
queries	ask	for	information	that	
appear	in	more	than	one	versions
SPVB:	Architecture	
12th	International	Workshop	on	Scalable	Semantic	Web	Knowledge	Base	Systems	 10	
SPB	Data	
Generator	
V0	 V1	 Vn	…	
added	triples	
V0	 V1	 V3	V2	 V4	
evenly	distributed	
BBC		
Ontologies	
5	DBpedia	Versions	for	1000	BBC	entities	
Virtuoso	Triple	
Store	
generated		
data	
Task	Provider	
Evaluation	
Storage	
Benchmarked	
System	
deleted	triples	
SPARQL	Queries	
Expected	
	results	
Results	
SPARQL	
Queries	
Evaluation	
Module	
Data	Generator	
Expected		
results
SPVB:	Data	Generation	(1)	
•  Generation	of	versions	that	contain	realistic	data	and	real	
DBpedia	data	
•  Generation	of	benchmarking	tasks	(SPARQL	queries)	
•  Computation	of	expected	results	
•  Configuration	Parameters	
1.  Data	generation	seed	
2.  Initial	version	size	
3.  Number	of	versions	
4.  Version	insertion	ratio	(%)	
5.  Version	deletion	ratio	(%)	
6.  Generated	data	form:	Independent	Copies	(IC),	Delta	or	
ChangeSets	(CS),	Both	(IC	+	CS)	
12th	International	Workshop	on	Scalable	Semantic	Web	Knowledge	Base	Systems	 11
SPVB:	Generation	of	Synthetic,	Realistic	Data	(2)	
12th	International	Workshop	on	Scalable	Semantic	Web	Knowledge	Base	Systems	 12	
SPB	Data	
Generator	
V0	 V1	 Vn	…	
added	triples	
V0	 V1	 V3	V2	 V4	
evenly	distributed	
BBC		
Ontologies	
5	Dbpedia	Versions	for	1000	BBC	entities	
Virtuoso	Triple	
Store	
generated		
data	
Task	Provider	
Evaluation	
Storage	
Benchmarked	
System	
deleted	triples	
SPARQL	
Queries	
Expected		results	 Expected	
	results	
Results	
SPARQL	
Queries	
Evaluation	
Module
SPVB:	Generation	of	Synthetic,	Realistic	Data	(3)	
•  Use	of	the	SPB	data	generator	that	produces	RDF	descriptions	
of	BBC	creative	works	that	store	metadata	about	real	entities	
•  Generated	datasets	simulate	the	activity	of	a	publishing	
organization	for	a	specific	time	period	
•  Models	3	types	of	relations	in	data	
–  Clustering	of	data	
–  Correlations	of	entities		
–  Random	tagging	of	entities	
12th	International	Workshop	on	Scalable	Semantic	Web	Knowledge	Base	Systems	 13	
“David	Bowie	leads		
Lou	Reed	tribute”	
&cw1	
“David	Bowie	leads	tribute	to	
‘master’	Lou	Reed”	
dbpedia:David_Bowie	
dbpedia:Lou_Reed	
;tle	
shortTitle	
men;ons	
about
SPVB:	DBpedia	Data	(4)	
12th	International	Workshop	on	Scalable	Semantic	Web	Knowledge	Base	Systems	 14	
SPB	Data	
Generator	
V0	 V1	 Vn	…	
added	triples	
V0	 V1	 V3	V2	 V4	
evenly	distributed	
BBC		
Ontologies	
5	Dbpedia	Versions	for	1000	BBC	entities	
Virtuoso	Triple	
Store	
generated		
data	
Task	Provider	
Evaluation	
Storage	
Benchmarked	
System	
deleted	triples	
SPARQL	
Queries	
Expected		results	 Expected	
	results	
Results	
SPARQL	
Queries	
Evaluation	
Module
SPVB:	Generation	of	DBpedia	Data	(5)	
•  Real	DBpedia	data	
–  5	versions	of	DBpedia	(2012	–	2016)	integrated	in	SPVB	
–  The	1000	most	important	entities	(according	to	a	score	provided	
by	SPB)	used	for	creative	work		annotation	
–  The	DBpedia	subgraphs	for	those	entities	compose	each	version	
–  DBpedia	versions	are	“equally	distributed”	to	the	total	produced	
one	
12th	International	Workshop	on	Scalable	Semantic	Web	Knowledge	Base	Systems	 15	
&cw1	
dbpedia:David_Bowie	
dbpedia:Lou_Reed	
men$ons	
about	
Integration	of	DBpedia	versions	&	SPB	datasets	is	done	
by	means	of	SPB	Creative	Works’	about	&	mentions	
properties	that	are	references	to	DBpedia
Task	Generation	(1)	
12th	International	Workshop	on	Scalable	Semantic	Web	Knowledge	Base	Systems	 16	
SPB	Data	
Generator	
V0	 V1	 Vn	…	
added	triples	
V0	 V1	 V3	V2	 V4	
evenly	distributed	
BBC		
Ontologies	
5	Dbpedia	Versions	for	1000	BBC	entities	
Virtuoso	Triple	
Store	
generated		
data	
Task	Provider	
Evaluation	
Storage	
Benchmarked	
System	
deleted	triples	
SPARQL	
Queries	
Expected		results	 Expected	
	results	
Results	
SPARQL	
Queries	
Evaluation	
Module
Task	Generation	(2)	
•  Support	for	8	Query	Types	(QT)	
•  For	each	query	type	one	or	more	query	templates	are	defined	
based	on	SPB	query	templates	
–  Each	version	is	stored	in	a	different	named	graph	
–  Template	contains	placeholders	of	the	form	{{{placeholder}}}	
•  refers	to	the	queried	version	
•  refers	to	an	IRI	from	DBpedia	
12th	International	Workshop	on	Scalable	Semantic	Web	Knowledge	Base	Systems	 17	
	
SELECT	DISTINCT	?creativeWork	?v1		
FROM	{{{graphVhistorical}}}	
WHERE	{ 			
	?creativeWork	cwork:about	
{{{cwAboutUri}}}	.	
	{{{cwAboutUri}}}	rdf:type	?v1	.	
}
Task	Generation	(2)		
•  For	the	query	types	QT2,	QT4,	QT8	we	use	6	of	the	25	DBpedia	
SPARQL	Benchmark	(DBPSB)	Query	Templates	
•  The	templates	selected	do	not	return	empty	results	when	
considering	the	integrated	DBpedia	data	
12th	International	Workshop	on	Scalable	Semantic	Web	Knowledge	Base	Systems	 18	
	
SELECT	DISTINCT	?creativeWork	?v1		
FROM	{{{graphVhistorical}}}	
WHERE	{			
	?creativeWork	cwork:about	
{{{cwAboutUri}}}	.	
	{{{cwAboutUri}}}	rdf:type	?v1	.	
}
Task	Generation	(3)		
•  Placeholder	replacement:	
–  queried	version:	wide	range	of	available	versions	is	covered	
•  IRI	from	Dbpedia	
–  Same	placeholders	used	in	the	DBPSB	query	templates	
–  Queries	are	produced	by	replacing	placeholders	with	values/
variables	
–  randomly	pick	one	of	1000	concrete	values	
12th	International	Workshop	on	Scalable	Semantic	Web	Knowledge	Base	Systems	 19	
	
SELECT	DISTINCT	?creativeWork	?v1		
FROM	{{{graphVhistorical}}}	
WHERE	{			
	?creativeWork	cwork:about	
{{{cwAboutUri}}}	.	
	{{{cwAboutUri}}}	rdf:type	?v1	.	
}	
	
SELECT	DISTINCT	?creativeWork	?v1		
FROM	<http://graph.version.1>	
WHERE	{			
	?creativeWork	cwork:about	
{{{cwAboutUri}}}	.	
	{{{cwAboutUri}}}	rdf:type	?v1	.	
}	
	
SELECT	DISTINCT	?creativeWork	?v1		
FROM	<http://graph.version.1>	
WHERE	{			
	?creativeWork	cwork:about	dbpedia:David_Bowie	.	
		dbpedia:David_Bowie	rdf:type	?v1	.	
}
Experiments	(1)		
•  Benchmarked	systems:	
–  R43ples	
–  Virtuoso	
•  Experimental	setup	
–  3	datasets	of	different	initial	size	
•  100K,	500K	1M	triples	
–  5	different	versions	
–  Timeout	of	1	hour		
•  Baseline:	Virtuoso	with	full	materialization		
12th	International	Workshop	on	Scalable	Semantic	Web	Knowledge	Base	Systems	 20
Experiments	(2):	Virtuoso	
•  Initial	version	ingestion	speed	outperforms	the	applied	changes	
speed	
–  Overhead	of	the	chosen	versioning	strategy	(full	materialization)	
–  Unchanged	information	between	versions	is	duplicated	
•  Significant	overhead	on	storage	space	is	due	to	the	versioning	
strategy	used	
12th	International	Workshop	on	Scalable	Semantic	Web	Knowledge	Base	Systems	 21
Experiments	(3):	Virtuoso	
•  Execution	times	are	short	(based	on	the	data	size),	as	all	the	
versions	are	already	materialized	in	the	triple	store	
12th	International	Workshop	on	Scalable	Semantic	Web	Knowledge	Base	Systems	 22
Experiments	(4):	R43ples	
•  Only	managed	to	run	experiments	for	the	100K	triples	dataset	
•  Changes	are	applied	slower	than	the	triples	of	the	initial	version	are	loaded	
–  Current	version	kept	materialized	
•  Many	queries	failed	to	return	the	correct	results	
•  Response	times	are	order(s)	of	magnitude	slower	than	Virtuoso	
12th	International	Workshop	on	Scalable	Semantic	Web	Knowledge	Base	Systems	 23	
Metric	 Result	
V0	Ingestion	speed	(triples/sec)	 3502.39	
Changes	speed	(changes/sec)	 2767.56	
Storage	Cost	(MB)	 197378	
Throughput	(queries/sec)	 0.09	
Queries	failed	
Metric	 Result	 Succeeded	Queries	
QT1	(ms)	 13887.33	 0/1	
QT2	(ms)	 146.28	 25/30	
QT3	(ms)	 18265.78	 0/3	
QT4	(ms)	 11681.49	 13/18	
QT5	(ms)	 31294.00	 0/4	
QT6	(ms)	 12299.58	 4/4	
QT7	(ms)	 35294.33	 2/3	
QT8	(ms)	 19177.33	 30/36
Conclusions	
•  Using	multiple	Data	Generator	components	to	parallelize	data	
generation	
•  Make	query	workload	more	configurable	
•  Include	or	exclude	specific	query	types	
•  Graphically	visualize	KPIs	
•  Experiment	with	larger	number	of	versioning	systems	
12th	International	Workshop	on	Scalable	Semantic	Web	Knowledge	Base	Systems	 24
12th	International	Workshop	on	Scalable	Semantic	Web	Knowledge	Base	Systems	 25	
This	work	was	supported	by	grands	from	the	EU	H2020	Framework	Programme	
provided	for	the	project	HOBBIT	(GA	no.	688227).

Assessing Linked Data Versioning Systems: The Semantic Publishing Versioning Benchmark