RDF Data Quality Assessment - connecting the pieces

{RDF} Data Quality Assessment
Connecting the pieces...
Dimitris Kontokostas
Senior Knowledge Engineer
Connected Data London 2018 - Nov 7th 2018

About me...
● Data geek, software engineer & open source enthusiast
● PhD in knowledge extraction and quality assessment
● Involved in graph-related standardization activities (ShEx/SHACL)
● Author of the RDFUnit Java library
● Co-author of “Validating RDF Data” book
● Working on the GeoPhy Real Estate Knowledge Graph

Overview
● Attempt to define data quality
● Identify data quality issues
● Means for tackling them

Quality of life...
Image credits

Multidimensional
image credits
Data Quality is

Which one is better?
ex:Foo
a dbo:Person ;
dbo:birthDate ”2000-01-01”^^xsd:date .
ex:Bar
a foaf:Person ;
foaf:age 18 .
ex:Baz
wkd:p31 wk:Q5 ;
wkd:p569 ”2000-01-01”^^xsd:date .

Would you use this information for …
ex:Chickenpox
a ex:InfectiousDisease ;
ex:symptoms ”rash”, “fever”, “headache” ;
ex:treatWithVaccine ex:VaricellaVaccine .
ex:VaricellaVaccine
a ex:Vaccine ;
ex:treats ex:Chickenpox, ex:HerpesZoster .
- a visualization?
- a disease website?
- automated treatment?

Fitness for use
Data Quality is

Data Quality Dimension themes
Accessibility: accessing & retrieving data, complete or part of
Contextual: depend on the use-case context or consumer preference
Intrinsic: independent of context
Representational: related to data design
See A. Zaveri et al. Quality Assessment of Linked data a Survey

Accessibility Dimensions
Availability can you access the data?
Licence can you use the data?
Performance can you get the data in reasonable time?

Contextual Dimensions
Relevancy does it cover your needs?
Trustworthiness do you trust the publisher?
Understandability do you understand the data? Is there documentation?
Timeliness is the data stalled or up to date?

Intrinsic Dimensions
Semantically valid are there any syntax errors?
Semantically accurate are there outliers, misused labels?
Consistent are there inconsistencies?
Concise are there duplicates and/or ambiguity, NULLs?
Complete are records or values missing?

Representational Dimensions
Interoperable are terms/labels/vocabularies reused?
Interpretable is it self-descriptive?
Versatile is it provided in multiple formats / languages?

How good do you need it to get?
There is a great costs in:
> assessing the quality of dataset
> improving the quality of dataset
Costs is highly dependent on whether:
> data & assets are outside of your control
> data & assets are within your control
> data & assets are bought
More or less than what you need impacts costs and/or product
Quality Cost ($)

Where data can go wrong
Source data
Master schema
Mappings
Validation Rules
Identity Resolution
Data Fusion

Source data
can be (semi) unstructured
can be messy
cannot fit into a/the schema

Master schema
Incorrect modeling
Incomplete modeling
Inaccurate translation
> to owl, rdfs, ShEx, SHACL, etc
Undesired expressivity
> RDFS, OWL: DL/RL/FULL, etc

Mappings
Incorrect mapping
errors scale to the source size (up to millions)
Incomplete mapping
Software bugs
conversion scripts, ETL code, etc
Model sync
> port schema updates

Validation Rules
Incorrect translation
> birthDate max cardinality 1
> birthDate min cardinality 1
Syntax error & typos
> dirthDate must be xsd:date
Model sync
> port schema updates

Evolution & quality
↻
↻
↻
↻
↻
↻
See http://aligned-project.eu

Sounds good so
far… now what?

Strategies for managing quality
Data testers
> explicit / implicit roles
Crowdsourcing
> field experts vs MTurk
Executable validation rules
> SHACL, ShEx, OWL
See Acosta et al. Detecting Linked Data quality issues via crowdsourcing: A DBpedia study
kontokostas et al. TripleCheckMate: A Tool for Crowdsourcing the Quality Assessment of Linked Data / Demo
> Needs good tool support
> Generic tools missing
> Validation engines improved

Validate closer to the source of the error
↻
↻
↻
↻
↻
see Dimou et al. Assessing and Refining Mappings to RDF to Improve Dataset Quality
Kontokostas et al. Semantically Enhanced Quality Assurance in theJURION Business Use Case
> Always in the K range
> Scales with source size
> Errors scale as well

Automate, automate & automate...
ex:name
a rdf:Property ;
rdfs:range rdf:langString .
Schema.ttl
ex:Foo
a dbo:Person ;
ex:name “Foo @en” .
Data.ttl

Automate, automate & automate...
ex:name
a rdf:Property ;
rdfs:range rdf:langString .
Schema.ttl
ex:Foo
a dbo:Person ;
ex:name “Foo @en” .
ex:name “Foo”@en .
Data.ttl

CI/CD is your best friend
Treat data as code
> Jenkins, Travis, GitLab, TeamCity, ...
Trigger validation on every (single) change
> Fail the build until data issues are fixed
Create (data) integration tests
Just like in software…
> Green CI <> No Errors/bugs
> Green CI => Not enough tests

Recap
> Data quality is fitness for use
> Can be assessed with multiple dimensions
> Identify the quality you need
> Also look for errors in the schema, the rules and the mappings
> Validate closer to the error source
> Automate as much as possible

Thank you! Questions?
@jimkont
kontokostas.com
slideshare.net/jimkont

RDF Data Quality Assessment - connecting the pieces

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to RDF Data Quality Assessment - connecting the pieces

Similar to RDF Data Quality Assessment - connecting the pieces (20)

More from Connected Data World

More from Connected Data World (20)

Recently uploaded

Recently uploaded (20)

RDF Data Quality Assessment - connecting the pieces