Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Quality and the FAIR principles


Published on

Presentation at the Conquaire workshop on Data Quality and Reproducibility:

Published in: Education
  • Be the first to comment

  • Be the first to like this

Data Quality and the FAIR principles

  1. 1. Data Quality and the FAIR principles 1 Amrapali Zaveri Conquaire Workshop on Data Quality and Reproducibility April 3, 2019 @amrapaliz
  2. 2. 2
  3. 3. An increasing number of discoveries are made using other people’s data 3
  4. 4. 4 Garbage Garbage Biggest Challenge: Poor Data Quality *
  5. 5. IBM Watson delivered ‘unsafe and inaccurate’ cancer recommendations inaccurate-cancer-recommendations/ In one example, a patient was recommended a drug that could lead to severe or fatal hemorrhage while he was already dealing with severe bleeding due to his condition. In another example, a Florida doctor who reviewed the system told the company that the technology is "a piece of shit." Consequences of Poor Data Quality - Inaccuracy
  6. 6. Quality issues in Integrating Data
  7. 7. Data Quality Assessment Dimensions & Metrics 7
  8. 8. Definitions •Data Quality: commonly conceived as a multi- dimensional construct with a popular definition ‘fitness for use’*. •Dimension: characteristics of a dataset. •Metric: or indicator is a procedure for measuring an information quality dimension. *Juran et al., The Quality Control Handbook, 1974
  9. 9. Data Quality Assessment Goals Fix data quality issues in given sets of (semantic) data Such quality issues may • be in source datasets (e.g., inaccurate or wrong data items, outdated data items) • result from imperfections of a data integration process (e.g., data items that have been incorrectly linked with each other) • reveal themselves only after the data integration (e.g., duplicates, inconsistencies) Data cleaning may be relevant both, for original datasets before combining/ integrating and for datasets resulting from an integration. Source:
  10. 10. Systematic Literature Review 30 core articles Conference - 21 Journal - 8 Masters Thesis - 1 18 Dimensions 69 Metrics
  11. 11. Data Quality Dimensions Quality assessment for linked data: A survey. A Zaveri, A Rula, A Maurino, R Pietrobon, J Lehmann, S Auer. Semantic Web 7 (1), 63-93 300+ citations
  12. 12. Open Quality Dimensions - Accessibility Availability - extent to which data (or some portion of it) is present, obtainable and ready for use Metrics: • accessibility of the SPARQL endpoint and the server • dereferenceability of the URI
 Is there access information for resources provided? Is information available that can help to discover/search datasets? Is there information about format, size or update frequency of the resources? Can the described resources be retrieved by an agent? Also relates to Metadata Quality!
  13. 13. Open Quality Dimensions - Interlinking Interlinking - degree to which entities that represent the same concept are linked to each other, be it within or between two or more data sources Metrics: • detection of the existence and usage of external URIs (target dataset) • detection of all local in-links or back-links: all triples from a dataset that have the resource’s URI as the object
  14. 14. Open Quality Dimensions - Contextual Trustworthiness - defined as the degree to which the information is accepted to be correct, true, real and credible. Metrics • annotating triples with provenance data and usage of provenance history to evaluate the trustworthiness of facts - FAIR provenance and metadata, provenance ontologies (PROV-O, HCLS) • opinion-based method, which use trust annotations made by several individuals • Specification of the license of the dataset or resource
  15. 15. Open Data Quality Dimensions - Open Data Is the specified format and license information suitable to classify a dataset as open? Metrics: • Is the file format based on an open standard? • Can the file format be considered as machine readable? • Does the used license conform to the open definition? Sebastian Neumaier, Jurgen Umbrich, and Axel Polleres, 2015. Automated Quality Assessment of Metadata across Open Data Portals. ACM J. Data Inform. DOI: http://
  16. 16. Data Quality Assessment Tools 16
  17. 17. RDF Unit Testing Suite Syntactic Validity Semantic Accuracy Consis- tency
  18. 18. Crowdsourcing Quality Assessment Acosta, M.; Zaveri, A.; Simperl, E.; Kontokostas, D.; Flöck, F. & Lehmann, J. (2016), 'Detecting Linked Data Quality Issues via Crowdsourcing: A DBpedia Study', Semantic Web Journal .
  19. 19. Shape Expressions Shape Declaration Shape Validation :User {
 schema:name xsd:string ;
 schema:birthDate xsd:date? ;
 schema:gender [schema:Male schema:Female ] OR xsd:string ;
 schema:knows IRI @:User*
 } alice schema:name "Alice" ; # Passes as a :User
 schema:gender schema:Female ;
 schema:knows :bob .
 :bob schema:gender schema:Male ; # Passes as a :User
 schema:name "Robert";
 schema:birthDate "1980-03-10"^^xsd:date .
 :dave schema:name "Dave"; # Fails as a :User
 schema:gender "XYY"; 
 schema:birthDate 1980 . # 1980 is not an xsd:date *)

  20. 20. Data Quality Assessment Improvement 20
  21. 21. Data Quality Improvement • Root cause analysis • Iterative process • Ensure high data and metadata quality
  22. 22. W3C Data Quality Vocabulary
  23. 23. Poor quality (meta)data hampers research
  24. 24. 24 Lambin et al. Radiother Oncol. 2013. 109(1):159-64. doi: 10.1016/j.radonc.2013.07.007
  25. 25. If we are ever to realize the full potential of content we create
 then we must find ways to reduce the barrier to publish, find and reuse their content in a responsible manner 25
  26. 26. FAIR Principles
  27. 27. Principles to enhance the value of all digital resources data, images, software, web services, repositories,… Developed and endorsed by researchers, publishers, funding agencies, industry partners. 27
  28. 28. Improving the FAIRness of digital resources will increase their quality and their potential and ease for reuse. 28
  29. 29. Thank you! Questions?