Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Workshop on Data Quality Management in Wikidata

199 views

Published on

Opening Keynote at the Workshop on Data Quality Management in Wikidata on "Open Data Quality: dimensions, metrics, assessment and improvement​"

Published in: Education
  • Be the first to comment

Workshop on Data Quality Management in Wikidata

  1. 1. Open Data Quality: dimensions, metrics, assessment and improvement 1 Amrapali Zaveri Workshop on Data Quality Management in Wikidata Jan 18, 2019 @amrapaliz
  2. 2. 2
  3. 3. An increasing number of discoveries are made using other people’s data 3 @amrapaliz
  4. 4. 4 Garbage Garbage Biggest Challenge: Poor Data Quality *http://www.ibmbigdatahub.com/infographic/four-vs-big-data @amrapaliz
  5. 5. Data Quality Assessment Dimensions & Metrics 5 @amrapaliz
  6. 6. Systematic Literature Review 30 core articles Conference - 21 Journal - 8 Masters Thesis - 1 18 Dimensions 69 Metrics @amrapaliz
  7. 7. LDQ Dimensions & Metrics Quality assessment for linked data: A survey. A Zaveri, A Rula, A Maurino, R Pietrobon, J Lehmann, S Auer. Semantic Web 7 (1), 63-93 300+ citations @amrapaliz
  8. 8. LDQ Dimensions & Metrics •Data Quality: commonly conceived as a multi- dimensional construct with a popular definition ‘fitness for use’*. •Dimension: characteristics of a dataset. •Metric: or indicator is a procedure for measuring an information quality dimension. *Juran et al., The Quality Control Handbook, 1974 @amrapaliz
  9. 9. LDQ Assessment Goal Fix data quality issues in given sets of (semantic) data Such quality issues may • be in source datasets (e.g., inaccurate or wrong data items, outdated data items) • result from imperfections of a data integration process (e.g., data items that have been incorrectly linked with each other) • reveal themselves only after the data integration (e.g., duplicates, inconsistencies) Data cleaning may be relevant both, for original datasets before combining/ integrating and for datasets resulting from an integration. Source: http://www.ida.liu.se/research/semanticweb/events/SemDataMgmtTutorial-Part7- Cleaning.pdf @amrapaliz
  10. 10. 18 LDQ Dimensions @amrapaliz
  11. 11. LDQ Dimensions - Accessibility dimensions & metrics • Availability - extent to which data (or some portion of it) is present, obtainable and ready for use • accessibility of the SPARQL endpoint and the server • dereferenceability of the URI • Interlinking - degree to which entities that represent the same concept are linked to each other, be it within or between two or more data sources • detection of the existence and usage of external URIs • detection of all local in-links or back-links: all triples from a dataset that have the resource’s URI as the object @amrapaliz
  12. 12. LDQ Dimensions - Intrinsic dimensions & metrics • Syntactic Validity - degree to which an RDF document conforms to the specification of the serialization format • detecting syntax errors using (i) validators, (ii) via crowdsourcing • by (i) use of explicit definition of the allowed values for a datatype, (ii) syntactic rules (type of characters allowed and/ or the pattern of literal values)
 • @amrapaliz
  13. 13. LDQ Dimensions - Intrinsic dimensions & metrics • Completeness • Schema - ontology completeness • Property - missing values for a specific property • Population - % of all real-world objects of a particular type • Interlinking - degree to which instances in the dataset are interlinked @amrapaliz
  14. 14. Data Quality Assessment Tools 14 @amrapaliz
  15. 15. RDFUnit: RDF Unit-Testing Suite http://aksw.org/Projects/RDFUnit.html Syntactic Semantic Consiste @amrapaliz
  16. 16. 16 Crowdsourcing Linked Data Quality Assessment Crowdsourcing linked data quality assessment M Acosta, A Zaveri, E Simperl, D Kontokostas, S Auer, J Lehmann ISWC 2013 @amrapaliz
  17. 17. Luzzu: QA for LOD http://eis-bonn.github.io/Luzzu/index.html 2 Asses 3 Clean 4 Store 5 Rank 1 Metric @amrapaliz
  18. 18. LDQ Assessment Tools — LODLaundromat http://lodlaundromat.org/ @amrapaliz
  19. 19. LDQ Beyond Data — Mapping Quality Dimou et al. Assessing and Refining Mappings to RDF to Improve Dataset Quality. ISWC 2015. https://github.com/RMLio/RML-Validator @amrapaliz
  20. 20. LDQ- ShEx Validation https://www.w3.org/2013/ShEx/Examples/ @amrapaliz
  21. 21. Data Quality Assessment Improvement 21 @amrapaliz
  22. 22. Data Quality Improvement • Root cause analysis • Iterative process • Ensure high data and metadata quality @amrapaliz
  23. 23. W3C Data Quality Vocabulary https://www.w3.org/TR/vocab-dqv/ @amrapaliz
  24. 24. Poor quality (meta)data hampers research @amrapaliz
  25. 25. 25 *400 AI papers 6% code 30% test data 54% pseudo code *http://science.sciencemag.org/content/359/6377/725 @amrapaliz
  26. 26. 26 Lambin et al. Radiother Oncol. 2013. 109(1):159-64. doi: 10.1016/j.radonc.2013.07.007 @amrapaliz
  27. 27. If we are ever to realize the full potential of content we create
 
 then we must find ways to reduce the barrier to publish, find and reuse their content in a responsible manner 27 @amrapaliz
  28. 28. FAIR Principles http://www.nature.com/articles/sdata201618 @amrapaliz
  29. 29. Principles to enhance the value of all digital resources data, images, software, web services, repositories,… Developed and endorsed by researchers, publishers, funding agencies, industry partners. 29 @amrapaliz
  30. 30. Improving the FAIRness of digital resources will increase their quality and their potential and ease for reuse. 30 @amrapaliz
  31. 31. Thank you! Questions? @amrapaliz

×