Successfully reported this slideshow.
Your SlideShare is downloading. ×

Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)


Check these out next

1 of 26 Ad

More Related Content

Similar to Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018) (20)

More from Péter Király (20)


Recently uploaded (20)

Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)

  1. 1. Evaluating Data Quality in Europeana: Metrics for Multilinguality Péter Király1, Juliane Stiller2, Valentine Charles3, Werner Bailer4, Nuno Freire5 1 Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen 2 Berlin School of Library and Information Science, Humboldt-Universität zu Berlin 3 Europeana Foundation, The Hague 4 Joanneum Research Forschungsgesellschaft mbH, Graz 5 INESC-ID, Lisbon MTSR 2018 - Track on Cultural Collections and Applications, Limassol, Oct. 24, 2018 1 Nummertjes by Fabio (CC BY-NC 2.0)
  2. 2. Agenda 1. Europeana 2. Multilingual Information in Europeana’s Metadata 3. Multilinguality as a Facet of Quality Dimensions 4. Results 5. Demo 2
  3. 3. Europeana - Platform for Cultural Heritage Material
  4. 4. ○ Books, newspapers, letters, paintings, photographs, radio shows, films, etc. ○ Text, images, video, audio, sounds, 3D ○ Over 58 million objects ○ > 50 languages Europeana - Facts 4
  5. 5. Multilingual Information in Europeana’s Metadata 5
  6. 6. English cultural heritage object: <dc:language>en</dc:language>
  7. 7. English cultural heritage object: <dc:language>en</dc:language> German metadata
  8. 8. Multilinguality on Field Level <#record> a ore:Proxy ; dc:subject “Ballet”, “Opera”@en <#record> a ore:Proxy ; edm:europeanaProxy true ; dc:subject <>. <> a skos:Concept . skos:prefLabel "Ballett"@no, "बैले"@hi, "Ballett"@de, "Балет"@be, "Балет"@ru, "Balé"@pt, "Балет"@bg, "Baletas"@lt, "Balet"@hr, "Balets"@lv . Europeana Dereferencing Literal, literal with language tag
  9. 9. Processes Contributing to Multilinguality dc: subject “subject”@en dc:creator < > dc:type <http://voc.example./…> dc:subject < aSubjectID> dc:subject “Subject” Data from Provider dc:creator new labels in different languages Data added by Europeana: dereferencing step Quantifiable: “term”@language annotation dc:subject New labels in different languages
  10. 10. Quantify Multilinguality of Data to: ○ Establish a sense of the multilingual reach of Europeana, incl. distribution of languages ○ Identify the impact of different workflows / processes on multilinguality of data ○ Take measures to improve multilinguality in data ○ Devise strategies for underrepresented languages
  11. 11. What Could be Measured? ○ Number of (distinct) languages in the metadata ○ Number of language-tagged literals ○ Tagged literals per language ○ Existence of language information fields such as dc:language ○ Consistency and conformity of language information
  12. 12. Multilinguality as a Facet of Quality Dimensions 12
  13. 13. Completeness ○ This dimension: ○ expresses the number (fraction) of fields present in a dataset ○ identifies non-empty values in a record or (sub-)collection. ○ Multilingual completeness is captured by: ○ Presence of value in dc:language ○ Share of fields with language tags to overall available fields
  14. 14. Consistency ○ Describes the logical coherence of metadata ○ Assesses variety of language values in the dc:language field: how many distinct values? ○ Contributes to features like language-based facet
  15. 15. Conformity ○ Describes the conformity to a given standard such as ISO-639-2 ○ Example: English is expressed as: English, ENG, en, en-uk, … ○ Share of values that comply or do not comply
  16. 16. Accessibility ○ Access to information and data across languages ○ Distribution of linguistic information in metadata ○ Quantifying the language tag ○ The more language tags, the higher the multilingual reach
  17. 17. Dimensions, Criteria & Measures Dimension Criteria Measure Completeness Presence or absence of values in fields relating to the language of the object or the metadata Share of multilingual fields to overall fields Presence or absence of dc:language field Consistency Variance in language notation Distinct language notations Conformity Compliance to ISO-639-2 Share of values that comply Accessibility Accessibility across languages expressed through language tags Number of distinct languages Number of languages/Number of tagged literals tagged literals per language
  18. 18. Results 18
  19. 19. Data processing workflow web interface statistical analysis measuring ingestion ★ OAI-PMH ★ Europeana API ★ Hadoop ★ NoSQL ★ Spark ★ Hadoop ★ Java ★ Apache Solr ★ Spark ★ R ★ PHP ★ D3.js ★ highchart.js ★ NoSQL json csv json, png html, svg 20
  20. 20. DEMO
  21. 21. Questions ★ Contact ★ Metadata Quality Assurance Framework ★ Europeana Data Quality Committee quality-committee 22
  22. 22. Discussion & Future Work 23
  23. 23. Discussion ○ Completeness and Consistency are fairly easy to interpret ○ Accessibility measures are harder, e.g. contextual entities for broad concepts or common places have often more translations than less known things ○ Applying quality dimensions is tricky, e.g. technical accessibility vs. accessibility across languages ○ No common understanding of quality dimensions
  24. 24. Future Work ○ Embedding into Europeana workflow ○ Evaluation of the metrics
  25. 25. Metadata Multilinguality 26 + 40 other languages....

Editor's Notes

  • Update
  • Distinguish from consistency
  • Please add your email
  • update