Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Improving data quality at Europeana - SWIB16

783 views

Published on

Improving data quality at Europeana - SWIB16

Published in: Software
  • If we are speaking about saving time and money this site ⇒ www.WritePaper.info ⇐ is going to be the best option!! I personally used lots of times and remain highly satisfied.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Sie können Hilfe bekommen bei ⇒ www.WritersHilfe.com ⇐. Erfolg und Grüße!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Yes you are right. There are many research paper writing services available now. But almost services are fake and illegal. Only a genuine service will treat their customer with quality research papers. ⇒ www.HelpWriting.net ⇐
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Improving data quality at Europeana - SWIB16

  1. 1. Improving data quality at Europeana New requirements and methods for better measuring metadata quality Péter Király1, Hugo Manguinhas2, Valentine Charles2, Antoine Isaac2, Timothy Hill2 1Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen 2Europeana Foundation, The Netherlands
  2. 2. Improving data quality at Europeana. The data workflow 2 data transformations Europeana Data Model (EDM) Dublin Core, LIDO, EAD, MARC, EDM custom, ...
  3. 3. Improving data quality at Europeana. The problem 3 there are “good” and “bad” metadata records but we don’t have clear metrics like this: functional requirements goodacceptablebad
  4. 4. Improving data quality at Europeana. Non-informative values 4 non informative dc:title: “photograph, framed”, “group photograph” “photograph” informative dc:title: “Photograph of Sir Dugald Clerk”, “Photograph of "Puffing Billy"”
  5. 5. Improving data quality at Europeana. Copy & paste cataloging 5 from a template? more examples in Report and Recommendations from the Task Force on Metadata Quality (2015)
  6. 6. Improving data quality at Europeana. Why data quality is important? 6 “Fitness for purpose” (QA principle) no metadata no access to data no data usage more explanation: Data on the Web Best Practices W3C Working Draft, https://www.w3.org/TR/dwbp/
  7. 7. Improving data quality at Europeana. Data Quality Committee 7
  8. 8. Improving data quality at Europeana. Hypothesis 8 by measuring structural elements we can predict metadata record quality ≃ metadata smell
  9. 9. Improving data quality at Europeana. Purposes 9 ▪ improve the metadata ▪ services: good data → reliable functions ▪ better metadata schema & documentation ▪ propagate “good practice”
  10. 10. Improving data quality at Europeana. What to measure? 10 ▪ Structural and semantic features Cardinality, uniqueness, length, dictionary entry, data type conformance, multilinguality (schema-independent measurements) ▪ Discovery scenarios Requirements of the most important functions ▪ Problem catalog Known metadata problems
  11. 11. Improving data quality at Europeana. Discovery scenarios 11 ▪ Basic retrieval with high precision and recall ▪ Cross-language recall ▪ Entity-based facets ▪ Date-based facets ▪ Improved language facets ▪ Browse by subjects and resource types ▪ Browse by agents ▪ Hierarchical search and facets ▪ ... themostimportantfunctions
  12. 12. Improving data quality at Europeana. Metadata requirements 12 As a user I want to be able to filter by whether a person is the subject of a book, or its author, engraver, printer etc. Metadata analysis In each case the underlying requirement is that the relevant EDM fields for objects be populated with URIs rather than free text. These URIs need to be related, at a minimum, to a label for each of the supported languages. Measurement rules ▪ the relevant field values should be resolvable URI ▪ each URI should be associated with labels in multiple languages
  13. 13. Improving data quality at Europeana. Problem catalog 13 ▪ Title contents same as description contents ▪ Systematic use of the same title ▪ Bad string: “empty” (and variants) ▪ Shelfmarks and other identifiers in fields ▪ Creator not an agent name ▪ Absurd geographical location ▪ Subject field used as description field ▪ Unicode U+FFFD (�) ▪ Very short description field ▪ ... “metadataanti-patterns”
  14. 14. Improving data quality at Europeana. Problem definition 14 Description Title contents same as description contents Example /2023702/35D943DF60D779EC9EF31F5DF... Motivation Distorts search weightings Checking Method Field comparison Notes Record display: creator concatenated onto title Metadata ScenarioBasic Retrieval
  15. 15. Improving data quality at Europeana. Measurement 15 overall view collection view record view Completeness – 40 measurements Field cardinality – 127 measurements Uniqueness – 6 measurements Multilinguality – 300+ measurements Language specification – 127 measurements Problem catalog – 3 measurements etc. links measurementsaggregated numbers
  16. 16. Improving data quality at Europeana. Field frequency per collections 16 no record has alternative title every record has alternative title filters
  17. 17. Improving data quality at Europeana. Details of field cardinality 17 128 subjects in one record median is 0, mean is close to 1 link to interesting records
  18. 18. Improving data quality at Europeana. Multilinguality 18 @resource is a URI @ = language notation in RDF no language specification
  19. 19. Improving data quality at Europeana. Language frequency 19 has language specification has no language specification
  20. 20. Improving data quality at Europeana. Encoding problems 20 same language, different encodings
  21. 21. Improving data quality at Europeana. Multilingual saturation 21 Levels of Multilinguality per field Expressed in numbers Missing field NA Text string without language tag 0 Text string with language tag 1 Text string with 2-3 different language tags 2 Text string with 4-9 different language tags 2.3 Text string with 10+ different language tags 2.6 Link to controlled vocabulary 3 Penalty for strings mixed with translations with no language tag -0.2
  22. 22. Improving data quality at Europeana. Multilingual saturation 22
  23. 23. Improving data quality at Europeana. Information content 23 1 means a unique term 0.0000x means a very frequent term These are cumulative numbers entropycumulative = term1 + ... + termn
  24. 24. Improving data quality at Europeana. Outliers 24 bulk of records are close to zero although 25% are between 0.05 and 1.25
  25. 25. Improving data quality at Europeana. Architecture 25 Apache Spark OAI-PMH client (PHP) Analysis with Spark (Scala) Analysis with R Web interface (php, d3.js) Hadoop File System JSON files Apache Solr NoSQL datastore JSON files JSON files image files CSV files CSV files recent workflow planned workflow
  26. 26. Improving data quality at Europeana. Further steps 26 ▪ Translate the results into documentation, recommendations ▪ Communication with data providers ▪ Human evaluation of metadata quality ▪ Cooperation with other projects ▪ Incorporating into Europeana’s new ingestion tool ▪ Shape Constraint Language (SHACL) for defining patterns ▪ Process usage statistics ▪ Measuring changes of scores ▪ Machine learning based classification & clustering human analysis technical
  27. 27. Improving data quality at Europeana. Links 27 ▪ Europeana Data Quality Committee: http://pro.europeana.eu/europeana-tech/data-quality- committee ▪ site: http://144.76.218.178/europeana-qa/ ▪ codes: http://pkiraly.github.io/about/#source-codes

×