Successfully reported this slideshow.
Your SlideShare is downloading. ×

Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)


Check these out next

1 of 25 Ad

More Related Content

Similar to Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021) (20)

More from Péter Király (20)


Introduction to data quality management (BVB KVB FDM-KompetenzPool, 2021)

  1. 1. Introduction to data quality management FDM-Kompetenzpool-Treffen mit Austausch und Fortbildung zu "Datenqualität", 2021-07-06, BibliotheksVerbund Bayern – Kommission Virtuelle Bibliothek Péter Király {, @kiru,} Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
  2. 2. the problem 2
  3. 3. top 20 patterns, ‘date’ field, MoMa collection Harald Klinke (LMU München) 3
  4. 4. Generic title and bad thumbnail 4 more examples in Report and Recommendations from the Task Force on Metadata Quality (2015)
  5. 5. 1. measure records 2. aggregate 3. report 4. evaluate with experts catalogue improve records data quality management lifecycle 5 quality assessment explore your data! data knowledge about data remediation plan 1 2 3
  6. 6. quality and ‘fitness for purpose’ ’We know it when we see it, but conveying the full bundle of assumptions and experience that allow us to identify it is a different matter.’ 6
  7. 7. metadata quality 7 purpose: to access content no metadata no access to data no data usage more explanation: Data on the Web Best Practices W3C Working Draft, bad metadata
  8. 8. the problem statement – improved 8 there are “good” and “bad” metadata records we would like to achieve metrics like this: functional requirements good acceptable bad
  9. 9. general metrics ★ completeness: number of metadata elements filled out ★ accuracy: data correspond to the resource that is being described ★ consistency: values compliant to what is defined by the metadata scheme ★ objectiveness: values describe the resource in an unbiased way ★ appropriateness: values are facilitating the deployment of search ★ correctness: syntactically and grammatically correct language Bruce and Hillman (2004); Ochoa and Duval (2009); Palavitsinis (2014) 9
  10. 10. linked data dimensions and metrics accessibility ★ Availability ★ Licensing ★ Interlinking ★ Security ★ Performance intrinsic ★ Syntactic validity ★ Semantic accuracy ★ Consistency ★ Conciseness ★ Completeness contextual ★ Relevancy ★ Trustworthiness ★ Understandability ★ Timeliness representational ★ Representational conciseness ★ Interoperability ★ Interpretability ★ Versatility Stvilia et al. (2007); Zaveri et al. (2015) 10
  11. 11. The good metrics are ★ clear ★ realistic ★ discriminating ★ measurable ★ universality – tool: F-UJI (FAIRsFAIR Research Data Object Assessment Service) FAIR metrics 11
  12. 12. F1 – Identifier Uniqueness What is being measured? Whether there is a scheme to uniquely identify the digital resource. How do we measure it? An identifier scheme is valid if and only if it is described in a repository that can register and present such identifier schemes (e.g. 12
  13. 13. RDFUnit, SHACL and ShEx ★ Linked Data is based on Open World assumption ★ No “record”, no clear boundaries ★ RDF Data Shapes: reinventing the schema ★ ShEx (Shape Expressions, and SHACL (Shapes Constraint Language, ★ Finding individual data issues 13
  14. 14. Core constraints Cardinality minCount, maxCount Types of values class, datatype, nodeKind Shapes node, property, in, hasValue Range of values minInclusive, maxInclusive, minExclusive, maxExclusive String based minLength, maxLength, pattern, stem, uniqueLang Logical constraints not, and, or, xone Closed shapes closed, ignoredProperties Property pair constraints equals, disjoint, lessThan, lessThanOrEquals Non-validating constraints name, value, defaultValue Qualified shapes qualifiedValueShape, qualifiedMinCount, qualifiedMaxCount 14
  15. 15. The Quartz guide to bad data (2015) ★ by Christopher Groskopf ★ guide for data journalist about how to recognize data issues ★ practical guide, not an academic paper ★ take-away messages: ○ be sceptic about the data ○ check it with exploratory data analysis ○ check it early, check it often ★, guide-to-bad-data/ 15
  16. 16. where and who should solve issues? ★ Issues that your source should solve ○ Values are missing ○ Zeros replace missing values ★ Issues that you should solve ○ Sample is biased ○ Data has been manually edited ★ Issues a third-party expert should help you solve ○ Author is untrustworthy ○ Collection process is opaque ★ Issues a programmer should help you solve ○ Data are aggregated to the wrong categories or geographies ○ Data are in scanned documents
  17. 17.
  18. 18. in practice part II 18
  19. 19. hypothesis 19 by measuring structural elements we can approximate metadata record quality ≃ metadata smell
  20. 20. organisational proposal 20 Europeana* Data Quality Committee ★ Analysing/revising metadata schema ★ Functional requirement analysis ★ Problem catalog ★ Multilinguality * elsewhere: DDB, British Library, Digital Library Federation, DPLA ...
  21. 21. technical proposal 21 “Metadata Quality Assessment Framework” a generic tool for measuring metadata quality ★ adaptable to different metadata schemes ★ scalable (to Big Data) ★ understandable reports for data curators ★ open source
  22. 22. What to measure? 22 ★Structural and semantic features Completeness, cardinality, uniqueness, length, dictionary entry, data type conformance, multilinguality (generic metrics) ★Functional requirement analysis / Discovery scenarios Requirements of the most important functions ★Problem catalog Known metadata problems
  23. 23. demos 23 ★Europeana metadata quality dashboard https://rnd- ★Union catalogue of BibliotheksVerbund Bayern
  24. 24. data quality management lifecycle 24 1. measure records 2. aggregate 3. report 4. evaluate with experts catalogue improve records quality assessment explore your data! data knowledge about data remediation plan 1 2 3
  25. 25. Let’s cooperate! ★ ★ ★ ★ ★ ★ Király (2019) Measuring metadata quality. 10.13140/RG.2.2.33177.77920 ★ Király–Brase (2021) Qualitätsmanagement. In Praxishandbuch Forschungsdatenmanagement, 10.1515/9783110657807-020 25