This document discusses assessing and refining mappings to RDF to improve dataset quality. It introduces mapping quality assessment (MQA) using RDFUnit to identify violations in mappings before datasets are generated. This allows violations to be addressed by refining mappings rather than after publication. MQA is faster and more effective than traditional dataset quality assessment alone.
Assessing and Refining Mappings to RDF to Improve Dataset Quality
1. Assessing and Refining Mappings to RDF
to Improve Dataset Quality
Kontokostas@informatik.uni-leipzig.de
@jimkont
Anastasia Dimou1, Dimitris Kontokostas2, Markus Freudenberg2,
Ruben Verborgh1, Jens Lehmann2, Erik Mannens1,
Sebastian Hellmann2, Rik Van de Walle1
Anastasia.Dimou@UGent.be
@natadimou
1Ghent University – iMinds – MMLab
2AKSW – Leipzig University
http://RML.io ● http://RDFUnit.aksw.org
2. Linked Open Data
semantically annotated using
different vocabularies or ontologies
and interlinked data representations
published in the form of RDF datasets
derive from originally heterogeneous
(semi-)structured data
3. RDF Dataset Quality
varies significantly ranging
from expensively curated
to relatively low quality datasets
4. RDF Dataset Quality - Intrinsic Dimension
determines the RDF Dataset Quality
by assessing it for possible violations
with respect to
accuracy (e.g. malformed datatype literals)
consistency (e.g. disjoint classes/properties)
7. Violations
Most frequent violations are
related to the dataset's schema
(vocabularies or ontologies)
dbo:birthDate range xsd:date
dbo:birthDate domain dbo:Person
http://example.com/
Chuck_Bednarik
dbo:Event
"1925-05-01"
xsd:gYear
dbo:birthDate
8. RDF DQA with RDFUnit
test-driven data-debugging framework
based on SPARQL-patterns
http://rdfunit.aksw.org
D. Kontokostas, P. Westphal, S. Auer, S. Hellmann, J. Lehmann, R. Cornelissen, and A. J. Zaveri
Test-driven evaluation of linked data quality
In Proceedings of the 23rd International Conference on World Wide Web
9. RDF DQA with RDFUnit
test-driven data-debugging framework
based on SPARQL-patterns
dbo:birthDate
http://example.com/
Chuck_Bednarik
dbo:Event
"1925-05-01"
xsd:gYear
http://rdfunit.aksw.org
11. Violations
Most frequent violations are
related to the dataset's schema
(vocabularies or ontologies)
Similar violations occur repeatedly
within a single RDF dataset
15. sets of triples of a dataset have repetitive patterns
dbo:birthDate
http://example.com/
{Name}_{Surname}
dbo:Event
“Birth"
xsd:gYear
Mapping languages
formalize patterns into rules
to generate the RDF dataset
from the original data
16. Instead of applying Quality Assessment
to the already published RDF dataset
as part of data consumption
Apply Quality Assessment to the Mappings
that generate the RDF dataset
Incorporate Quality Assessment
in the publishing workflow
17. DQA: Dataset Quality Assessment
is applied by third parties
to already published RDF dataset
violations
DQA
18. DQA: Dataset Quality Assessment
Adjustments to the dataset
are manually but rarely applied
but not at the root (hard to identify)
are overwritten if a new version of
the original data is mapped & published
violations
DQA
20. sets of triples of a dataset have repetitive patterns
dbo:birthDatehttp://example.com/
{Name}_{Surname}
dbo:Event
“Birth"
xsd:gYear
Mapping languages
formalize patterns into rules
to generate the RDF dataset
from the original data
21. sets of triples of a dataset have repetitive patterns
Name Surname Birth
Chuck Bednarik 1925-05-01
Matt McBride 1985-05-23
Steve Meilinger 1930-12-12
Brick Bronsky 1964
Giddeon Massie 1981-08-27
dbo:birthDatehttp://example.com/
{Name}_{Surname}
dbo:Event
“Birth"
xsd:gYear
22. RDF Mapping Language (RML)
specify the mapping definitions to
generate RDF representation
from heterogeneous data sources
extends the W3C-recommended R2RML
http://rml.io
A. Dimou, M. Vander Sande, P. Colpaert, R. Verborgh, E. Mannens, and R. Van de Walle.
RML: A Generic Language for Integrated RDF Mappings of Heterogeneous Data.
In Proceedings of the 7th Workshop on Linked Data on the Web (LDOW2014), 2014.
32. MQA: Mapping Quality Assessment
discover violations before
they are even generated
specify the origin of the violation
easily apply structural adjustments
to the mapping definitions
40. Beyond Mapping Quality Assessment
certain test cases inevitably
require the RDF Dataset
cardinality,
functionality,
symmetricity
41. Beyond Mapping Quality Assessment
certain test cases inevitably
require the RDF Dataset
cardinality,
functionality,
symmetricity
reflect to the data,
DO NOT affected by the mapping definitions
42. Mapping Quality Assessment (MQA)
prevent the violations generation
prevent same violations to appear
repeatedly over distinct entities
allow intuitively combining
different ontologies and vocabularies
44. Dataset Vs Mapping Quality Assessment
Number of Violations
Dataset Quality Assessment Mapping Quality Assessment
#fail test cases #violations #fail test cases #violations
DBPedia EN 1,128 3.2M 1 160
DBPedia NL 683 815k 1 124
DBLP 7 8.1M 2 8
*Dbpedia and D2RQ Mappings were translated to RML mappings
45. Dataset Vs Mapping Quality Assessment
Time
Dataset Quality Assessment Mapping Quality Assessment
size time size time
DBPedia EN 62M 16h 115K 11s
DBPedia NL 21M 1.5h 53K 6s
DBLP 12M 12h 368 12s
CEUR-WS* 2.4k 6s 702 5s
iLastic 150k 12s 825 15s
*CEUR-WS submission to the ESWC Semantic Publishing Challenge (2014 Vs 2015)
46. Mapping Quality Assessment
Mapping Quality Assessment
size time
DBPedia EN 115K 11s
DBPedia NL 53K 6s
DBPedia All 511K 32s
* http://mappings.dbpedia.org/validation
Live update of DBpedia Mapping Quality Assessment results every night!
47. Violations
Most frequent violations are
related to the dataset's schema
(vocabularies or ontologies)
Similar violations occur repeatedly
within a single RDF dataset
The situation aggravates the more
ontologies and vocabularies
are reused and combined
48. Quality Assessment
shifted from data consumption
to data publication
integrated systematically
in the publishing workflow
violations are identified,
resolved and will not re-appear
RDF dataset of higher Quality is generated