Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Quality Assessment in Europeana: Metrics for Multilinguality


Published on

This is the presentation from the (meta)-data quality workshop at TPDL 2017 in Thessaloniki, Greece. aggregates metadata describing more than 55
million cultural heritage objects from libraries, museums, archives and audiovisual archives across Europe. Quality (and particularly multilingual quality) of metadata is crucial for enabling search and data re-use in digital libraries. Capturing multilingual aspects of the data requires us to take into account the full lifecycle of data aggregation including data enhancement processes such as data enrichment. Multilinguality cannot be captured as one measure, but needs to be considered as an intersection of several measures, bringing challenges when interpreting and visualising
the results. This presentation presents an approach for capturing multilinguality as part of data quality dimensions, namely completeness, consistency and accessibility.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Data Quality Assessment in Europeana: Metrics for Multilinguality

  1. 1. Data Quality Assessment in Europeana: Metrics for Multilinguality Valentine Charles1, Juliane Stiller2, Péter Király3, Werner Bailer4, Nuno Freire5 1 Europeana Foundation, The Hague 2 Berlin School of Library and Information Science, Humboldt-Universität zu Berlin 3 Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen 4 Joanneum Research Forschungsgesellschaft mbH, Graz 5 INESC-ID, Lisbon TPDL 2017 (meta)-data quality workshop, Thessaloniki, September 21, 2017 1 “Measuring tape” by Therese Banström (CC BY-NC 2.0)
  2. 2. Agenda 1. Europeana 2. Multilinguality of Metadata and Functional Requirements 3. Multilinguality as a Facet of Quality Dimensions 4. Measuring Multilingual Metadata Quality 5. First Results 6. Discussion & Future Work 2
  3. 3. Europeana, Platform for Cultural Heritage Material 3
  4. 4. ○ Books, newspapers, letters, paintings, photographs, radio shows, films, etc. ○ Text, images, video, audio, sounds, 3D ○ Over 53 million objects ○ > 50 languages Europeana - Facts 4
  5. 5. Thumbnail Metadata Link to Provider
  6. 6. Multilinguality of Metadata & Functional Requirements 6
  7. 7. Metadata Multilinguality 7+ 40 other languages....
  8. 8. Multilingual Entities 8
  9. 9. Quantify Multilinguality of Data to: ○ Establish a sense of the multilingual reach of Europeana, incl. distribution of languages ○ Identify the impact of different workflows / processes on multilinguality of data ○ Take measures to improve multilinguality in data ○ Devise strategies for underrepresented languages
  10. 10. What Could be Measured? ○ Number of (distinct) languages in the metadata ○ Number of language-tagged literals ○ Tagged literals per language ○ Existence of language information fields such as dc:language ○ Consistency of language information Requirement: language annotations / tags!
  11. 11. Multilingual Information <#record> a ore:Proxy ; dc:subject “Ballet”, “Opera”@en <#record> a ore:Proxy ; edm:europeanaProxy true ; dc:subject <>. <> a skos:Concept . skos:prefLabel "Ballett"@no, "बैले"@hi, "Ballett"@de, "Балет"@be, "Балет"@ru , "Balé"@pt, "Балет"@bg, "Baletas"@lt, "Balet"@hr, "Balets"@lv . Europeana Enrichment Literal, literal with language tag
  12. 12. Processes Contributing to Multilinguality dc: subject “subject”@en dc:creator < aPersonNumber> dc:type http://vocab.example/dom ain-spcific dc:subject < bjectID> dc:subject “Subject” Data from Provider dc:creator new labels in different languages Data added by Europeana: dereferencing step Quantifiable dc:subject New labels in different langauges
  13. 13. Functional Requirements for Multilingual Services ○ Cross-lingual search ○ Language-based facets ○ Entity-based facets
  14. 14. Multilinguality as a Facet of Quality Dimensions 14
  15. 15. Completeness ○ expresses the number (fraction) of fields present in a dataset ○ identifies non-empty values in a record or (sub-)collection. ○ Problem: data model with optional fields ○ Multilingual completeness: ○ does the field dc:language has a value? ○ Share of fields with language tags to overall available fields
  16. 16. Consistency ○ Logical coherence of metadata ○ Variety of language values in the dc:language field
  17. 17. Accessibility ○ Access to information and data across languages ○ Distribution of linguistic information in metadata ○ Quantifying the language tag ○ On record level, collection level, Europeana level or across fields, e.g. how multilingual is the dc:subject field
  18. 18. Dimensions, Criteria & Measures Dimension Criteria Measure Completeness Presence or absence of values in fields relating to the language of the object or the metadata Share of multilingual fields to overall fields Presence or absence of dc:language field Consistency Variance in language notation Distinct language notations Accessibility Accessibility across languages expressed through language tags In language tags: Number of distinct languages Number of languages/Number of tagged literals Number of tagged literals per language
  19. 19. Measuring Multilingual Metadata Quality 19
  20. 20. Implementation source codes: data source: (Europeana snapshot, 2015 december) Access to the project: qa/multilinguality.php?id=all 20
  21. 21. Data processing workflow web interfacestatistical analysismeasuringingestion ★ OAI-PMH ★ Europeana API ★ Hadoop ★ NoSQL ★ Spark ★ Hadoop ★ Java ★ Apache Solr ★ Spark ★ R ★ PHP ★ D3.js ★ highchart.js ★ NoSQL json csv json, png html, svg 21
  22. 22. Visualization 2222
  23. 23. First Results 23
  24. 24. Completeness ○ 904 (out of 3,548) collections have no value in the dc:language field, which shows the field is missing. ○ On a record level, 58.03% of the records have a dc:language field. ○ misuse of fields ○ collections that have metadata fields with more than 3 instances of dc:language . ○ duplication of the language tag.
  25. 25. Consistency Total values in the Europeana dataset 33,070,941 Total values in ISO-639-1 31,803,048 (96.17%) Total values non-normalized 1,267,893 (3.83%) Error rate of the normalization (approx.) 1 / 212,766 dc:language: eng dc:language en dc:language en_GB en (ISO-639-1, 2 letter codes) 9,436,280 values needed normalization to ISO-639-1
  26. 26. Record level - Accessibility <#record> a ore:Proxy ; dc:subject “Ballet”, “Opera” . <#record> a ore:Proxy ; edm:europeanaProxy true ; dc:subject <> , <> . <> a skos:Concept . skos:prefLabel "Ballett"@no, "बैले"@hi, "Ballett"@de, "Балет"@be, "Балет"@ru , "Balé"@pt, "Балет"@bg, "Baletas"@lt, "Balet"@hr, "Balets"@lv . <> skos:prefLabel "Opera"@no, "ओपेरा (गीतिनाटक)"@hi, "Oper"@de, "Ooppera"@fi , "Опера"@be, "Опера"@ru, "Ópera"@pt, "Опера"@bg, "Opera"@lt . 0 0 11 19Distinct languages Tagged literals 1,7 Literals per language
  27. 27. Discussion & Future Work 27
  28. 28. Discussion ○ Completeness and Consistency are fairly easy to interpret ○ Accessibility measures are harder, e.g. contextual entities for broad concepts or common places have often more translations than less known things ○ Applying quality dimensions is tricky, e.g. technical accessibility vs. accessibility across languages ○ No common understanding of quality dimensions
  29. 29. Future Work ○ Conceptualize the quality dimensions for multilinguality ○ Work on implementation of visualizations that are straightforward for providers
  30. 30. Questions ○ Contact ○ Metadata Quality Assurance Framework ○ Europeana Data Quality Committee committee 30