Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s Metadata

582 views

Published on

Presentation at 15th International Symposium of Information Science (ISI 2017, isi2017.ib.hu-berlin.de/), Berlin, March 14, 2017

Published in: Data & Analytics
  • Be the first to comment

Multilinguality of Metadata. Measuring the Multilingual Degree of Europeana‘s Metadata

  1. 1. Multilinguality of Metadata Measuring the Multilingual Degree of Europeana‘s Metadata Juliane Stiller1, Péter Király2 1 Berlin School of Library and Information Science, Humboldt-Universität zu Berlin 2 Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen ISI 2017, March 14, 2017 1 Languages by eltpics
  2. 2. Agenda 1. Multilinguality in Europeana 2. Multilingual Score for Metadata 3. Implementation 4. Discussion & Future Work 2
  3. 3. Plattform for Cultural Heritage Material www.europeana.eu 3
  4. 4. ○ Books, newspapers, letters, paintings, photographs, radio shows, films, etc. ○ Text, images, video, audio, sounds, 3D ○ Over 54 million objects ○ > 50 languages Europeana - Facts http://statistics.europeana.eu/europeana 4
  5. 5. Thumbnail Metadata Link to Provider
  6. 6. Metadata Multilinguality 6+ 40 other languages....
  7. 7. The Multilingual Problem 7 ○ Mona Lisa 456 results ○ La Gioconda 365 results ○ La Joconde 71 results http://www.europeana.eu/portal/en/r ecord/90402/RP_F_00_351.html
  8. 8. Metadata Enrichment 8
  9. 9. Quantify the Multilinguality of Data to ○ Take measures to improve multilinguality in data ○ Establish a sense of the multilingual reach of Europeana ○ Distribution of languages ○ Devise strategies for underrepresented languages
  10. 10. Multilingual Score for Metadata 10
  11. 11. Multilingual saturation of metadata 11 Text w/o language annotation (dc.subject: Germany) Text w language annotation (dc.subject: Germany@en) Text w several language annotations (dc.subject: Germany@en, Deutschland@de) Link to (multilingual) vocabulary (http://www.geonames.org /2921044/ federal-republic-of-germany)
  12. 12. Calculation Missing field Text string without language tag (language not known) Text string with 2-3 different language tags Text string with 4-9 different language tags Text string with more than 10 different language tags Link to (multilingual) vocabulary Text string with language tag (language known) NA 0 1 2 2.3 2.6 3
  13. 13. Example score 13 Text w/o language annotation (dc.subject: Germany): Text w language annotation (dc.subject: Germany@en) Text w several language annotations (dc.subject: Germany@en, Deutschland@de) Link to (multilingual) vocabulary (http://www.geonames.org /2921044/ federal-republic-of-germany) 0 1 2 3
  14. 14. Aggregation of property dc:subject The Wittgenstein Archives at the University of Bergen: high saturation National Library Portugal: low saturation 14http://144.76.218.178/europeana-qa/saturation.php?collectionId=all&field=proxy_dc_subject&type=average
  15. 15. Good examples "Die Mauer muß weg!"@de "Die Mauer muß weg! (The Wall must go!)"@en 15 "Kommentiertes Fotorama mit Bildern von 1989-1990 in Berlin"@de "Annotated images from 1989- 1990 in Berlin"@en dc:descriptiondc:title "Brandenburger Tor"@de "Brandenburg Gate"@en "Grenzübergang Potsdamer Platz"@de "Postdamer Platz border crossing"@en "Reichstag"@de "Reichstag building"@en Place/skos:prefLabel Descriptive fields Subject headings
  16. 16. Implementation source codes: http://pkiraly.github.io/about/#source-codes data source: http://hdl.handle.net/21.11101/0000-0001-781F-7 (Europeana snapshot, 2015 december) 16
  17. 17. Data processing workflow web interfacestatistical analysismeasuringingestion ★ OAI-PMH ★ Europeana API ★ Hadoop ★ NoSQL ★ Spark ★ Hadoop ★ Java ★ Apache Solr ★ Spark ★ R ★ PHP ★ D3.js ★ highchart.js ★ NoSQL json csv json, png html, svg 17
  18. 18. Visualization 1818
  19. 19. APIs, abstraction, reusing "Place/skos:altLabel": { "instances": [ {"TRANSLATION": 2.0}, {"TRANSLATION": 2.0}, {"TRANSLATION": 2.0}, ... {"TRANSLATION": 2.40}, {"STRING": 0.0}, ], "score": { "sum": 20.40, "average": 1.85454545, "normalized": 0.649681 } }
  20. 20. Discussion & Future Work 20
  21. 21. extension I. recalculation The new metrics ★ Distinct languages per object ★ Language tags per object ★ Literals per language ★ Number of multilingual properties (a.k.a. fields) ★ Number of multilingual statements (a.k.a. field instances) ★ Average number of languages per property with language ★ Average number of languages per proxy 21
  22. 22. extension II. record views ex:providerProxy dc:subject "special relativity"@en ; dc:creator <http://vocab.getty.eu/ulan/500240971> ; dc:type <http://udcdata.info/001684> . ex:europeanaProxy dc:subject <http://dbpedia.org/resource/Physics> . <http://vocab.getty.edu/ulan/500240971> skos:prefLabel "Einstein, Albert"@de . standard vocabulary <http://dbpedia.org/resource/Physics> skos:prefLabel "Physics"@en . <http://udcdata.info/001684> skos:prefLabel "Books in general"@en . standard vocabulary non-standard vocabulary 22
  23. 23. extension II. record views source field link value ① ② ③ ④ ex:providerProxy dc:subject literal "special relativity"@en ① ② ③ ④ dc:creator standard "Einstein, Albert"@de ① ② ③ ④ dc:type non-std "Books in general"@en ② ④ ex:europeanaProxy dc:subject standard "Physics"@en ③ ④ ① data provider's proxy and standard enrichments ② data provider's proxy and enrichments ③ all proxies and standard enrichments ④ all proxies and enrichments 23
  24. 24. Questions ○ contact juliane.stiller@ibi.hu-berlin.de peter.kiraly@gwdg.de ○ Metadata Quality Assurance Framework http://144.76.218.178/europeana-qa ○ Europeana Data Quality Committee http://pro.europeana.eu/page/dat a-quality-committee 24
  25. 25. Appendix Europeana data structure in 30 sec provider proxy Europeana proxy Agent Concept Place Timespan descriptive fields subject headings semanticweb

×