Data Quality Assessment
for Linked Data: A Survey
Amrapali Zaveri, Anisa Rula, Andrea Maurino,
Ricardo Pietrobon, Jens Lehmann, Sören Auer
1Data Quality Tutorial, September 12, 2016
Outline
Survey Methodology

LDQ Dimensions and Metrics

LDQ Assessment Tools

LDQ In Practice
2
Outline
Survey Methodology

LDQ Dimensions and Metrics

LDQ Assessment Tools

LDQ In Practice
3
Survey Methodology — Steps I
Related Surveys
Research
Questions
Eligibility
Criteria
Search
Strategy
Title & Abstract
Reviewing
4
Survey Methodology — Research Questions
• How can one assess the quality of Linked Data employing a
conceptual framework integrating prior approaches?

• What are the data quality problems that each approach assesses?

• Which are the data quality dimensions and metrics supported by
the proposed approaches?

• What kinds of tools are available for data quality assessment?
5
Survey Methodology — Eligibility Criteria
Inclusion criteria:
Must satisfy:
• published between
2002 and 2014.

Should satisfy:
• data quality
assessment

• trust assessment 

• proposed and/or
implemented an
approach 

• assessed the quality
of LD or information
systems based on LD
Exclusion criteria:
• not peer-reviewed
• published as a poster abstract

• data quality management

• other forms of structured data
• did not propose any methodology or
framework
6
Survey Methodology — Steps
Remove duplicates
Further potential
articles
Compare short-
listed articles
Quantitative
analysis
Qualitative
analysis
7
Survey Methodology — Results
8
30 core
articles
Conference - 21
Journal - 8
Masters Thesis - 1
18 Dimensions
69 Metrics
Outline
Survey Methodology

LDQ Dimensions and Metrics

LDQ Assessment Tools

LDQ In Practice
9
LDQ Dimensions & Metrics
• Data Quality: commonly conceived as a multi-dimensional
construct with a popular definition ‘fitness for use’*.
• Dimension: characteristics of a dataset.
• Metric: or indicator is a procedure for measuring an information
quality dimension.
10
*Juran et al., The Quality Control Handbook, 1974
18 LDQ Dimensions
11
LDQ Dimensions - Accessibility dimensions & metrics
• Availability - extent to which data (or some portion of it) is present, obtainable and
ready for use

• accessibility of the SPARQL endpoint and the server

• dereferenceability of the URI

• Interlinking - degree to which entities that represent the same concept are linked to
each other, be it within or between two or more data sources

• detection of the existence and usage of external URIs
• detection of all local in-links or back-links: all triples from a dataset that have the
resource’s URI as the object
12
LDQ Dimensions - Representational dimensions & metrics
• Interoperability - degree to which the format and structure of the information conforms to
previously returned information as well as data from other sources

• detection of whether existing terms from all relevant vocabularies for that particular
domain have been reused

• usage of existing vocabularies for a particular domain

• Interpretability - refers to technical aspects of the data, that is, whether information is
represented using an appropriate notation and whether the machine is able to process the
data 

• detection of invalid usage of undefined classes and properties

• detecting the use of appropriate language, symbols, units, datatypes and clear definitions
13
LDQ Dimensions - Intrinsic dimensions & metrics
• Syntactic Validity - degree to which an RDF document conforms to
the specification of the serialization format

• detecting syntax errors using (i) validators, (ii) via crowdsourcing

• by (i) use of explicit definition of the allowed values for a datatype,
(ii) syntactic rules (type of characters allowed and/or the pattern of
literal values)

14
LDQ Dimensions - Intrinsic dimensions & metrics
• Completeness
• Schema - ontology completeness
• no. of classes and properties represented / total no. of classes and properties
• Property - missing values for a specific property
• no. of values represented for a specific property / total no. of values for a
specific property
• Population - % of all real-world objects of a particular type
• Interlinking - degree to which instances in the dataset are interlinked
15
LDQ Dimensions - Contextual dimensions & metrics
• Understandability - refers to the ease with which data can be comprehended
without ambiguity and be used by a human information consumer
• human-readable labelling of classes, properties and entities as well as
presence of metadata

• indication of the vocabularies used in the dataset

• Timeliness - measures how up-to-date data is relative to a specific task

• freshness of datasets based on currency and volatility

• freshness of datasets based on their data source
16
Outline
Survey Methodology

LDQ Dimensions and Metrics

LDQ Assessment Tools

LDQ In Practice
17
LDQ Assessment Tools
18
LDQ Assessment Tools - RDFUnit
http://aksw.org/Projects/RDFUnit.html 19
Syntactic
Validity
Semantic
Accuracy
Consistency
LDQ Assessment Tools - Dacura
http://dacura.cs.tcd.ie/about-dacura/ 20
Interpretability
Semantic
Accuracy
Consistency
Outline
Survey Methodology

LDQ Dimensions and Metrics

LDQ Assessment Tools

LDQ In Practice
21
Linked Data Quality — In Practice
22
Linked Data
Quality
Methodologies
Tools
Use Cases
Beyond Data
Vocabulary
23
Crowdsourcing Linked Data Quality Assessment
LDQ Assessment Tools — Luzzu
http://eis-bonn.github.io/Luzzu/index.html
24
2
Assess
3 Clean
4 Store5 Rank
1 Metric
LDQ Assessment Tools — LODLaundromat
http://lodlaundromat.org/
25
LDQ Use Cases — Open Data Portals
26
Automated Quality Assessment of Metadata across Open Data Portals.
Neumaier et. al., JDIQ 2016.
Completeness Interoperability
Relevancy Accuracy
Openness
LDQ Beyond Data — Mapping Quality
27
Dimou et al. Assessing and Refining Mappings to RDF to Improve Dataset Quality.
ISWC 2015.
https://github.com/RMLio/RML-Validator
28
W3C
Data
Quality
Vocabulary
https://www.w3.org/
TR/vocab-dqv/
W3C Data Quality Vocabulary
29
https://www.w3.org/TR/vocab-dqv/
dqv:Category
dqv:Dimension
dqv:Metric
dqv:QualityMe
asurement
qb:Observation
dqv:QualityMeas
urementDataset
qb:DataSet
dqv:inDimension
dqv:inCategory
dqv:isMeasurementOf
dqv:hasQuality
Measurement
Challenges
• Propagation of errors

• Management/Improvement

• Usage of the standard vocabulary

• Quality-based search engines
30
Thank you!

Questions?
amrapali@stanford.edu

@AmrapaliZ
Quality assessment for linked data: A survey
A Zaveri, A Rula, A Maurino, R Pietrobon, J Lehmann, S Auer
Semantic Web 7 (1), 63-93

Linked Data Quality Assessment: A Survey

  • 1.
    Data Quality Assessment forLinked Data: A Survey Amrapali Zaveri, Anisa Rula, Andrea Maurino, Ricardo Pietrobon, Jens Lehmann, Sören Auer 1Data Quality Tutorial, September 12, 2016
  • 2.
    Outline Survey Methodology LDQ Dimensionsand Metrics LDQ Assessment Tools LDQ In Practice 2
  • 3.
    Outline Survey Methodology LDQ Dimensionsand Metrics LDQ Assessment Tools LDQ In Practice 3
  • 4.
    Survey Methodology —Steps I Related Surveys Research Questions Eligibility Criteria Search Strategy Title & Abstract Reviewing 4
  • 5.
    Survey Methodology —Research Questions • How can one assess the quality of Linked Data employing a conceptual framework integrating prior approaches? • What are the data quality problems that each approach assesses? • Which are the data quality dimensions and metrics supported by the proposed approaches? • What kinds of tools are available for data quality assessment? 5
  • 6.
    Survey Methodology —Eligibility Criteria Inclusion criteria: Must satisfy: • published between 2002 and 2014. Should satisfy: • data quality assessment • trust assessment • proposed and/or implemented an approach • assessed the quality of LD or information systems based on LD Exclusion criteria: • not peer-reviewed • published as a poster abstract • data quality management • other forms of structured data • did not propose any methodology or framework 6
  • 7.
    Survey Methodology —Steps Remove duplicates Further potential articles Compare short- listed articles Quantitative analysis Qualitative analysis 7
  • 8.
    Survey Methodology —Results 8 30 core articles Conference - 21 Journal - 8 Masters Thesis - 1 18 Dimensions 69 Metrics
  • 9.
    Outline Survey Methodology LDQ Dimensionsand Metrics LDQ Assessment Tools LDQ In Practice 9
  • 10.
    LDQ Dimensions &Metrics • Data Quality: commonly conceived as a multi-dimensional construct with a popular definition ‘fitness for use’*. • Dimension: characteristics of a dataset. • Metric: or indicator is a procedure for measuring an information quality dimension. 10 *Juran et al., The Quality Control Handbook, 1974
  • 11.
  • 12.
    LDQ Dimensions -Accessibility dimensions & metrics • Availability - extent to which data (or some portion of it) is present, obtainable and ready for use • accessibility of the SPARQL endpoint and the server • dereferenceability of the URI • Interlinking - degree to which entities that represent the same concept are linked to each other, be it within or between two or more data sources • detection of the existence and usage of external URIs • detection of all local in-links or back-links: all triples from a dataset that have the resource’s URI as the object 12
  • 13.
    LDQ Dimensions -Representational dimensions & metrics • Interoperability - degree to which the format and structure of the information conforms to previously returned information as well as data from other sources • detection of whether existing terms from all relevant vocabularies for that particular domain have been reused • usage of existing vocabularies for a particular domain • Interpretability - refers to technical aspects of the data, that is, whether information is represented using an appropriate notation and whether the machine is able to process the data • detection of invalid usage of undefined classes and properties • detecting the use of appropriate language, symbols, units, datatypes and clear definitions 13
  • 14.
    LDQ Dimensions -Intrinsic dimensions & metrics • Syntactic Validity - degree to which an RDF document conforms to the specification of the serialization format • detecting syntax errors using (i) validators, (ii) via crowdsourcing • by (i) use of explicit definition of the allowed values for a datatype, (ii) syntactic rules (type of characters allowed and/or the pattern of literal values)
 14
  • 15.
    LDQ Dimensions -Intrinsic dimensions & metrics • Completeness • Schema - ontology completeness • no. of classes and properties represented / total no. of classes and properties • Property - missing values for a specific property • no. of values represented for a specific property / total no. of values for a specific property • Population - % of all real-world objects of a particular type • Interlinking - degree to which instances in the dataset are interlinked 15
  • 16.
    LDQ Dimensions -Contextual dimensions & metrics • Understandability - refers to the ease with which data can be comprehended without ambiguity and be used by a human information consumer • human-readable labelling of classes, properties and entities as well as presence of metadata • indication of the vocabularies used in the dataset • Timeliness - measures how up-to-date data is relative to a specific task • freshness of datasets based on currency and volatility • freshness of datasets based on their data source 16
  • 17.
    Outline Survey Methodology LDQ Dimensionsand Metrics LDQ Assessment Tools LDQ In Practice 17
  • 18.
  • 19.
    LDQ Assessment Tools- RDFUnit http://aksw.org/Projects/RDFUnit.html 19 Syntactic Validity Semantic Accuracy Consistency
  • 20.
    LDQ Assessment Tools- Dacura http://dacura.cs.tcd.ie/about-dacura/ 20 Interpretability Semantic Accuracy Consistency
  • 21.
    Outline Survey Methodology LDQ Dimensionsand Metrics LDQ Assessment Tools LDQ In Practice 21
  • 22.
    Linked Data Quality— In Practice 22 Linked Data Quality Methodologies Tools Use Cases Beyond Data Vocabulary
  • 23.
    23 Crowdsourcing Linked DataQuality Assessment
  • 24.
    LDQ Assessment Tools— Luzzu http://eis-bonn.github.io/Luzzu/index.html 24 2 Assess 3 Clean 4 Store5 Rank 1 Metric
  • 25.
    LDQ Assessment Tools— LODLaundromat http://lodlaundromat.org/ 25
  • 26.
    LDQ Use Cases— Open Data Portals 26 Automated Quality Assessment of Metadata across Open Data Portals. Neumaier et. al., JDIQ 2016. Completeness Interoperability Relevancy Accuracy Openness
  • 27.
    LDQ Beyond Data— Mapping Quality 27 Dimou et al. Assessing and Refining Mappings to RDF to Improve Dataset Quality. ISWC 2015. https://github.com/RMLio/RML-Validator
  • 28.
  • 29.
    W3C Data QualityVocabulary 29 https://www.w3.org/TR/vocab-dqv/ dqv:Category dqv:Dimension dqv:Metric dqv:QualityMe asurement qb:Observation dqv:QualityMeas urementDataset qb:DataSet dqv:inDimension dqv:inCategory dqv:isMeasurementOf dqv:hasQuality Measurement
  • 30.
    Challenges • Propagation oferrors • Management/Improvement • Usage of the standard vocabulary • Quality-based search engines 30
  • 31.
    Thank you! Questions? amrapali@stanford.edu @AmrapaliZ Quality assessmentfor linked data: A survey A Zaveri, A Rula, A Maurino, R Pietrobon, J Lehmann, S Auer Semantic Web 7 (1), 63-93