Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

RDF Data Quality Assessment - connecting the pieces

185 views

Published on

RDF and graph databases are steadily increasing their adoption and are no longer choices of niche-only communities. For almost 20 years, a constraint language for RDF was a big missing piece in the technology stack and a prohibiting factor for further adoption.

Even though most RDF-based systems were performing data validation and quality assessment, there was no standardized way to define constraints. People were using ad-hoc solutions or schemas and languages that were not meant for validation.

Thankfully, since 2017 there are 2 additions to the RDF technology stack: SHACL & ShEx. Both provide a high level RDF constraint language that people can use to define data constraints (a.k.a. Shapes), each with different strengths.

This talk provides an outline of different types of RDF data quality issues and existing approaches to quality assessment. The goal is to give an overview of the existing RDF validation landscape and hopefully, inspire people on how to improve their RDF publishing workflows.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

RDF Data Quality Assessment - connecting the pieces

  1. 1. {RDF} Data Quality Assessment Connecting the pieces... Dimitris Kontokostas Senior Knowledge Engineer Connected Data London 2018 - Nov 7th 2018
  2. 2. About me... ● Data geek, software engineer & open source enthusiast ● PhD in knowledge extraction and quality assessment ● Involved in graph-related standardization activities (ShEx/SHACL) ● Author of the RDFUnit Java library ● Co-author of “Validating RDF Data” book ● Working on the GeoPhy Real Estate Knowledge Graph
  3. 3. Overview ● Attempt to define data quality ● Identify data quality issues ● Means for tackling them
  4. 4. What is data quality?
  5. 5. What is ??? quality?
  6. 6. Quality of life... Image credits
  7. 7. Quality of OS
  8. 8. Multidimensional image credits Data Quality is
  9. 9. Which one is better? ex:Foo a dbo:Person ; dbo:birthDate ”2000-01-01”^^xsd:date . ex:Bar a foaf:Person ; foaf:age 18 . ex:Baz wkd:p31 wk:Q5 ; wkd:p569 ”2000-01-01”^^xsd:date .
  10. 10. Would you use this information for … ex:Chickenpox a ex:InfectiousDisease ; ex:symptoms ”rash”, “fever”, “headache” ; ex:treatWithVaccine ex:VaricellaVaccine . ex:VaricellaVaccine a ex:Vaccine ; ex:treats ex:Chickenpox, ex:HerpesZoster . - a visualization? - a disease website? - automated treatment?
  11. 11. Fitness for use Data Quality is
  12. 12. Data Quality Dimension themes Accessibility: accessing & retrieving data, complete or part of Contextual: depend on the use-case context or consumer preference Intrinsic: independent of context Representational: related to data design See A. Zaveri et al. Quality Assessment of Linked data a Survey
  13. 13. Accessibility Dimensions Availability can you access the data? Licence can you use the data? Performance can you get the data in reasonable time? See A. Zaveri et al. Quality Assessment of Linked data a Survey
  14. 14. Contextual Dimensions Relevancy does it cover your needs? Trustworthiness do you trust the publisher? Understandability do you understand the data? Is there documentation? Timeliness is the data stalled or up to date? See A. Zaveri et al. Quality Assessment of Linked data a Survey
  15. 15. Intrinsic Dimensions Semantically valid are there any syntax errors? Semantically accurate are there outliers, misused labels? Consistent are there inconsistencies? Concise are there duplicates and/or ambiguity, NULLs? Complete are records or values missing? See A. Zaveri et al. Quality Assessment of Linked data a Survey
  16. 16. Representational Dimensions Interoperable are terms/labels/vocabularies reused? Interpretable is it self-descriptive? Versatile is it provided in multiple formats / languages? See A. Zaveri et al. Quality Assessment of Linked data a Survey
  17. 17. How good do you need it to get? There is a great costs in: > assessing the quality of dataset > improving the quality of dataset Costs is highly dependent on whether: > data & assets are outside of your control > data & assets are within your control > data & assets are bought More or less than what you need impacts costs and/or product Quality Cost ($)
  18. 18. Where things can go wrong
  19. 19. Where data can go wrong
  20. 20. Where data can go wrong Source data Master schema Mappings Validation Rules Identity Resolution Data Fusion
  21. 21. Source data can be (semi) unstructured can be messy cannot fit into a/the schema
  22. 22. Master schema Incorrect modeling Incomplete modeling Inaccurate translation > to owl, rdfs, ShEx, SHACL, etc Undesired expressivity > RDFS, OWL: DL/RL/FULL, etc
  23. 23. Mappings Incorrect mapping errors scale to the source size (up to millions) Incomplete mapping Software bugs conversion scripts, ETL code, etc Model sync > port schema updates
  24. 24. Validation Rules Incorrect translation > birthDate max cardinality 1 > birthDate min cardinality 1 Syntax error & typos > dirthDate must be xsd:date Model sync > port schema updates
  25. 25. Evolution & quality ↻ ↻ ↻ ↻ ↻ ↻ See http://aligned-project.eu
  26. 26. Sounds good so far… now what?
  27. 27. Strategies for managing quality Data testers > explicit / implicit roles Crowdsourcing > field experts vs MTurk Executable validation rules > SHACL, ShEx, OWL See Acosta et al. Detecting Linked Data quality issues via crowdsourcing: A DBpedia study kontokostas et al. TripleCheckMate: A Tool for Crowdsourcing the Quality Assessment of Linked Data / Demo > Needs good tool support > Generic tools missing > Validation engines improved
  28. 28. Validate closer to the source of the error ↻ ↻ ↻ ↻ ↻ see Dimou et al. Assessing and Refining Mappings to RDF to Improve Dataset Quality Kontokostas et al. Semantically Enhanced Quality Assurance in theJURION Business Use Case > Always in the K range > Scales with source size > Errors scale as well
  29. 29. Automate, automate & automate... ex:name a rdf:Property ; rdfs:range rdf:langString . Schema.ttl ex:Foo a dbo:Person ; ex:name “Foo @en” . Data.ttl
  30. 30. Automate, automate & automate... ex:name a rdf:Property ; rdfs:range rdf:langString . Schema.ttl ex:Foo a dbo:Person ; ex:name “Foo @en” . ex:name “Foo”@en . Data.ttl
  31. 31. CI/CD is your best friend Treat data as code > Jenkins, Travis, GitLab, TeamCity, ... Trigger validation on every (single) change > Fail the build until data issues are fixed Create (data) integration tests Just like in software… > Green CI <> No Errors/bugs > Green CI => Not enough tests
  32. 32. Recap > Data quality is fitness for use > Can be assessed with multiple dimensions > Identify the quality you need > Also look for errors in the schema, the rules and the mappings > Validate closer to the error source > Automate as much as possible
  33. 33. Thank you! Questions? @jimkont kontokostas.com slideshare.net/jimkont

×