Quality Metrics for Linked Open Data

Behshid Behkamal, Mohsen Kahani, Ebrahim Bagheri
Quality Metrics for
Linked Open Data
Bagheri@ryerson.ca

What is the problem?
What have others done?
What is our solution?
Does it work?
Outline
2

3
Linked Open Data (LOD):
Realizing Semantic Web by interlinking existing but
dispersed data
Main components of LOD:
URIs to identify things
RDF to describe data
HTTP to access data

Inclusion Criteria for publishing and interlinking datasets
into LOD cloud
resolvable http/https URIs
Presented in one of the standard formats of Semantic Web (RDF, RDFa,
RDF/XML, Turtle, N-Triples)
Contains at least 1000 triples
Connected via at least 50 RDF links to the existing datasets of LOD
Accessible via RDF crawling, RDF dump, or SPARQL endpoint
4
Is the dataset
ready to be
published?

5
One approach:
Publish first, improve later
Results in:
Quality problems in the published datasets
Missing link:
Data Quality Evaluation Prior to Release

Data quality in the Context of LOD
•General Validators
•Parsing and Syntax
•Accessibility / Dereferencability
Validators
Quality Assessment of
Published data
• Classifying quality problems of LOD
• Using metadata for quality assessment
• filtering poor quality data (WIQA)
• Semantic Annotation using ontologies
6
What have others done?

Our Contribution
7
• Identifying important quality deficiencies that need to
be avoided or resolved prior to release
• Proposing a set of metrics to measure and identify
these quality deficiencies in a dataset.

Quality
Deficiency
Issues
Resolution
Method
Improper usage of
vocabularies
- Not using appropriate existing vocabularies to describe the
resources
Domain Expert
Redefining existing
classes/properties
- Redefining the classes/properties in the ontology that already exist
in the vocabularies
Domain Expert
Improper definition
of classes/properties
- Classes with different name, but the same relations Semi-Automated
- Properties with different name, but the same meaning Ontologist
- Inadequate number of classes/ properties used to describe the
resources
Domain Expert
Misuse of data type - Not using appropriate data types for the literals Automated
8
Quality Deficiencies (Schema level)

Quality
Deficiency
Issues
Resolution
Method
Errors in property values - Missing values Automated
- Out-of-range values Automated
- Misspelling Semi-Automated
- Inconsistent values Automated
Miss-match with the real-world - Resources without correspondence in real-world Domain Expert
Syntax errors - Triples containing syntax errors Validator
Misuse of data type /object property - Improper assignment of object property to the data
type property or vice versa
Validator
Improper usage of classes/properties - Using undefined classes/properties Semi-Automated
- Membership of disjoint classes Automated
- Misplaced classes/properties Validator
Redundant/similar individuals - Individuals with similar property values, but
different names
Ontologist
Invalid usage of Inverse-functional
properties
- Inverse-functional properties with void values Automated
9
Quality Deficiencies (Instance level)

Name Description Related quality deficiencies
Miss_Vlu
(M1)
The ratio of the properties defined in the schema, but not presented in
dataset. Errors in property values
Out_Vlu
(M2)
The ratio of the triples of dataset which contain properties with out of
range values Errors in property values
Msspl_Prp_Vlu
(M3) The ratio of the properties of dataset which contain misspelled values Errors in property values
Und_Cls_Prp
(M4)
The ratio of the triples of dataset using classes or properties without any
formal definition
Improper usage of classes/
properties
Dsj_Cls
(M5) The ratio of the instances of dataset being members of disjoint classes Improper usage of classes/
properties
Inc_Prp_Vlu
(M6)
The ratio of the triples of dataset in which the values of properties are
inconsistent Errors in property values
FP
(M7)
The ratio of the number of triples of dataset with functional properties
which contain inconsistent values Errors in property values
IFP
(M8)
The ratio of the number of triples of dataset which contain invalid usage
of inverse-functional properties.
Invalid usage of Inverse-
functional properties
Im_DT
(M9)
The ratio of the number of triples of dataset which contain data type
properties with inappropriate data types.
Not using appropriate data
types for the literals
Sml_Cls
(M10)
The ratio of the classes of dataset with different names, but the same
instances
Improper definition of
classes
10
Proposing Metrics

Selecting several real datasets from LOD
Calculation of the metrics values for
datasets
Metrics interdependency Study
Manipulating the quality of the datasets
Comparing the trends of Metrics over two
observations
11
Empirical Evaluation

Real world datasets
datasets
observations
12
Datasets
No. of
triples
No. of
instances
No. of
classes
No. of
properties
FAO Water Areas 10,730 586 31 19
Water Economic Zones 29,193 1,074 113 127
Large Marine Ecosystems 12,012 716 21 31
Geopolitical Entities 22,725 312 88 101
ISSCAAP Species Classification 398,166 25,253 52 93
Species Taxonomic Classification 319,490 11,741 33 26
Commodities 56,420 2,788 10 19
Vessels 4,236 240 6 22

datasets
observations
13

datasets
using heuristics
observations
14
√
√
√
3. Empirical Evaluation
Result:
• Two pairs of metrics are correlated:
{IFP, Im_DT}
{IFP, Inc_Prp_Vlu}
• The others are independent

datasets
using heuristics
observations
15
√
√
√

datasets
using heuristics
observations
16
√
√
√
√
√

Results analysis
In 78% of the scenarios, the metrics react as expected to the
corresponding heuristics:
Heuristics applied and corresponding metrics have changed (23%)
Heuristics have not been applied and the metric values have not changed
(55%)
In 22% of the scenarios, heuristics have been applied and corresponding
metrics have not changed, because:
Dependency between heuristics caused some side effects
The ratio of heuristics done over the size of datasets is very low
17

Quality Metrics for Linked Open Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Quality Metrics for Linked Open Data

Similar to Quality Metrics for Linked Open Data (20)

Recently uploaded

Recently uploaded (20)

Quality Metrics for Linked Open Data

Editor's Notes