Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2, 25 october 2019

88 views

Published on

Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2, 25 october 2019

  • Be the first to comment

  • Be the first to like this

Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2, 25 october 2019

  1. 1. Books on a table, Aalto, Ilmari, 1928, National Digital Library (NDL), Finland, CC0 EUROPEANA MEETING UNDER FINLAND’S PRESIDENCY OF THE COUNCIL OF THE EU ESPOO, FINLAND 25 October 2019
  2. 2. Books on a table, Aalto, Ilmari, 1928, National Digital Library (NDL), Finland, CC0 Andy Neale Technical Director Europeana Foundation Recap on main conclusions of Day 1
  3. 3. Content Information Access Interactions User Interface Metadata and digital CH objects Search, Browse & Explore Show user‘s preferred language Bridge the gap between language of user input and content Layers of digital CH system Juliane
  4. 4. Mismatch between query and content language • Mona Lisa 203 results • Monna Lisa 13 results • La Gioconda 376 results  • La Joconde 78 results 5 Interactions Roma, Galleria Corsini - La Gioconda, Juliane
  5. 5. Challenges • Missing training data for small languages • Missing training data for (sub)domains • Amount of language pairs is immense with 50+ languages • Metadata is too scarce for good translation results 6 Juliane
  6. 6. Evaluate solution based on goal ○ E.g. for ML retrieval we might not need the perfect fluent translation ○ Identify the impact of different workflows / processes on multilinguality of system ○ Translations do not only have an impact on data but also on retrieval and therefore on user satisfaction 7 Juliane
  7. 7. Challenges for LT in cultural heritage ● Interface or content (= multilingual in a broad sense) ● Far beyond modern standard language use ● Great variation makes domain adaptation hard ● Variation in place (dialects and languages), time (old Swedish) and situation (informal-formal) ● Modal variation in collections: (handwritten) text, speech, pictures ● Hard to handle as researchers want to explore a collection as a whole Rickard
  8. 8. Next steps ● Linked data to describe the collection conceptually and relationally ● Multilingual search methods for handling language variation in place, time and situation ● Domain adopted speech-to-text conversion to transcribe recordings ● Crowdsourcing for correcting ● Shared resources for the languages, dialects, domains etc ● Long time funding for the National Language Bank ● Collaborative projects involving LTists, researchers and data holders Rickard
  9. 9. Hugo.lv – AI powered language technology portal Andrejs & Jānis
  10. 10. Conclusions • New generation of Neural MT strongly improves quality and applicability of machine translation, especially for morphology rich languages • Domain specific data is crucial for making MT suitable for cultural and other domains • Depending on the application, translation needs can be served by selecting the most efficient approach – pure MT, human review of the MT, or fully human translation • We will be happy to share our experience, technologies and tools :) Andrejs & Jānis
  11. 11. Development Implementation Operation and maintenance Initiation (of a new service) time Process-time Use-time Future Who are involved in the development and implementation of your service? What kinds of benefits can be identified? Who uses your service? Are there other stakeholders? What kinds of benefits can be identified? Who could (re)use your service or materials in the (undefined) future? What kinds of benefits can be anticipated? Model for temporal division of benefits Kautonen, H. & Nieminen, M. (2018): Conceptualizing Benefits of User-Centered Design for Digital Library Services. Liber Quarterly, 28(1), ss. 1–34. DOI: http://doi.org/10.18352/lq.10231. Heli
  12. 12. Dasha
  13. 13. Dasha
  14. 14. Dasha
  15. 15. Language detection and display (for validation) Query translated in 24 languages Dasha
  16. 16. THE NATIONAL LIBRARY OF FINLAND Thesaurus to ontology ▪ Reconstruction of YSA into machine-readable and multilingual YSO ▪ Trilingual terms for concepts (fin, swe, eng) ▪ YSA and Allärs merged together and translated into English ▪ Concepts are a compromise between Finnish and Swedish as YSA and Allärs are not completely identical ▪ Links to Library of Congress Subject Headings (LCSH) ▪ Linking to Wikidata underway ▪ YSO just made the list of Europeana dereferenceable vocabularies that can be enriched in the Europeana portal Matias
  17. 17. THE NATIONAL LIBRARY OF FINLAND Annotate in one language, find using another Matias
  18. 18. THE NATIONAL LIBRARY OF FINLAND Automated Subject Indexing made easy: Annif ▪ An open source multilingual automated subject indexing system using machine learning and our own vocabularies Matias
  19. 19. Europeana’s Knowledge Graph Entity Collection Hugo
  20. 20. Proposals for indexing and storing translations ● Automated identification of language if needed (only 26.5% of the data provider’s metadata is language qualified) ● Use translations from multilingual knowledge graph ● Augment the provider metadata with static translation of the fields to English (to fill metadata values not covered by the knowledge graph) ● Store and index translated metadata for search and display (original metadata + languages of the knowledge graph + English) Hugo
  21. 21. Proposals for search on object metadata Identify language Original query Translate to English Multilingual index User Disambiguates Search Translated query (English) Suggest Entity (Knowledge Graph) Entity-based query Multilingual query: entity based query OR original query + translated query #1: French #2: Spanish #3: Polish Hugo
  22. 22. Session 4 CONTENT TRANSLATION Europa [Material cartográfico] : Nach den vorzüglichsten Hülfsnitteln, Götze, Johann August Ferdinand, 1773-1819 Biblioteca Digital de Madrid Spain, Public domain
  23. 23. Books on a table, Aalto, Ilmari, 1928, National Digital Library (NDL), Finland, CC0 Tom Vanallemeersch Machine translation specialist CrossLang The art of automating translation
  24. 24. Cultural heritage and translation ● Translation helps to open up cultures ● Rosetta stone was the key to understanding hieroglyphs Parallel data ● Systems for automated translation (now) act in a similar way ● However, the right stones are required, and many of them ...
  25. 25. Context of this talk EC project SMART 2016/0103: ● Identification of language technology needs of Digital Service Infrastructures of EC E.g. Europeana DSI ● Framework: Connecting Europe Facility – Automated Translation ● Contracting authority: DG CNECT (the EC's multilingual enabler) ● Consortium:
  26. 26. Guide to this talk Machine translation (MT): ● In general ● In a highly multilingual environment: eTranslation (EC) ● For EU cultural heritage Challenges: ●Domain imbalance ●Language imbalance ●Context demand ●Multimodal sources Approaches
  27. 27. MT in general ● MT systems are data-driven 🡪 Sentence pairs: They were living there - Ils habitaient là-bas 🡪 Software consisting of a neural network (like many recent AI applications) ● MT is used for various purposes 🡪 Post-editing, gisting, cross-lingual retrieval
  28. 28. MT in general: domain imbalance ● Quality typically improves when increasing training data ● But there are few (accessible) translations in some domains ● The same problem occurs for specific genres (e.g. novels) and registers (e.g. informal language) Difference in amount of domain-specific resources
  29. 29. MT in general: domain imbalance Approach: identify/create domain-specific data ● Select sentence pairs from the vast ParaCrawl Corpus ● Use the ParaCrawl toolkit for multilingual websites, archives ● Select domain-specific parallel corpora from the ELRC-SHARE repository ● Create artificial training data: e.g. apply MT to French in-domain data, add the English translations to English-French MT system Difference in amount of domain-specific resources
  30. 30. Guide to this talk Machine translation: ● In general ● In a highly multilingual environment: eTranslation (EC) ● For EU cultural heritage Challenges: ●Domain imbalance ●Language imbalance ●Context demand ●Multimodal sources Approaches
  31. 31. eTranslation ● 130+ out of 552 language pairs, often from or into English ● Sometimes pivot: ● Management: DG Translation (technical), DG CNECT (EU’s MT policy) ● Users: translators of DG Translation, public administrations in the EEA ● Free use ● Confidentiality and security MT system for 24 official EU languages + Icelandic and Norwegian (Bokmål) Finnish English Portuguese
  32. 32. eTranslation ● User interface: snippets, documents ● API: online services, … ● Domain of training data: legal and administrative texts ● Specific MT systems for some organisations 🡪 E.g. Court of Justice (French ⇄ X) MT system for 24 official EU languages + Icelandic and Norwegian (Bokmål)
  33. 33. eTranslation: language imbalance ● Resource-rich language pairs (many parallel data), e.g. English-French ● Resource-poor language pairs, e.g. English-Irish, English-Icelandic 🡪 Lower MT quality Difference in amount of training data for language pairs
  34. 34. eTranslation: language imbalance Approach: build multilingual models ● Recent research topic in MT ● Translation from many languages into one, from one into many, etc. ● Language pairs that “learn” from each other how to translate (pieces of) words ● Surprising improvements for resource-poor language pairs Difference in amount of training data for language pairs
  35. 35. eTranslation: language imbalance Approach: build multilingual models (continued) ● Recent workshop in Luxembourg, organised by CrossLang for DG CNECT 🡪 Moderated by high-profile expert from Facebook ● Google AI group: attempts at creating “universal MT” (102 languages for now) ● Opportunity for scaling up MT Difference in amount of training data for language pairs
  36. 36. Guide to this talk Machine translation: ● In general ● In a highly multilingual environment: eTranslation (EC) ● For EU cultural heritage Challenges: ●Domain imbalance ●Language imbalance ●Context demand ●Multimodal sources Approaches
  37. 37. MT for culture ● Post-editing: e.g. static text on websites ● Gisting: e.g. dynamic text like visitors’ comments ● Cross-lingual retrieval: e.g. search for objects having metadata in another language Potential uses
  38. 38. MT for culture: context demand Metadata consisting of short text fragments Title: note, bank = “financial institution” / “location near river” ? = “comment” / “money” ?
  39. 39. MT for culture: context demand Metadata consisting of short text fragments Title: note, bank Subject: paper money = “comment” / ”money” ? 🡪 Dutch: biljet Approach: make use of the remainder of the metadata
  40. 40. MT for culture: context demand Metadata consisting of short text fragments Approach: make use of the remainder of the metadata 🡪 Approach is also useful for named entity recognition: Description: The Utrecht artist De Heem is regarded as one … Artist: Jan Davidsz de Heem
  41. 41. MTforculture:languageimbalance Little or no parallel data involving “dead” / minority languages Approach for related languages: use available data + additional techniques ● Minority language + larger language ● Old + new language variant ● Advantage: similar vocabulary, spelling
  42. 42. MTforculture:languageimbalance Little or no parallel data involving “dead” / minority languages Alternative approach for related languages: train an unsupervised MT system ● Uses monolingual corpora for the two languages ● Identifies similar words and sentences in both languages ● Learns to translate in both directions
  43. 43. MT for culture: multimodal sources Translation in case of non-textual objects (including non-digitised text) ● Audio material ● Scanned documents ● Photographs with text ● Images without text Speech recognition OCR OCR (?) Text describing image Imperfect MT input
  44. 44. MT for culture: multimodal sources Translation in case of non-textual objects (including non-digitised text) Approach: correct output using metadata before applying MT OCR: Demer en Capueienen Metadata: … Capucienen …
  45. 45. Conclusions ● MT for cultural heritage stretches across many dimensions Languages, domains, genres, registers, periods, … ● It is a particularly interesting and demanding area for MT Huge potential of multilingual object metadata, big challenges ● Approaches involve new information sources, refinement of tools and methods
  46. 46. Books on a table, Aalto, Ilmari, 1928, National Digital Library (NDL), Finland, CC0 Antoine Isaac R&D Manager Europeana Foundation Case study - Content translation and search
  47. 47. Aspects of multilingual experience - Content A focused view of our conceptual model of multilingual approach
  48. 48. First experiments - Translation of virtual exhibitions
  49. 49. Translation of virtual exhibitions Pilot: apply eTranslation to assist manual translation of exhibitions ● Exhibitions from two Generic Services projects: ○ Migration in the Arts and Sciences ○ Rise of Literacy ● 13 people from 11 institutions reviewed translations from English into 8 languages: ○ Dutch, French, Hungarian, Italian, Lithuanian, Polish, Portuguese, Slovenian NB: no German (for which eTranslation has a "cultural" version)
  50. 50. Translation of virtual exhibitions Pilot: apply eTranslation to assist manual translation of exhibitions ● The output is medium to good but does not translate well the carefully crafted narrative text, leading to partners spending a lot of time rewriting ● The quality is too low yet to translate exhibitions sustainably and cost-effectively
  51. 51. Ongoing experiments - content translation and search New case study: using translation in search for text objects ● An important need for Europeana (cf. Newspapers, Transcriptions) ● One that may still work with less-than-perfect translations
  52. 52. The strategy for using translation in cross-lingual search Identify language Original query Translate to English Multilingual index User validation Search Translated query (English) Align to entity Entity-based query Multilingual query: entity based query + original query + translated query #1: French #2: Spanish #3: Polish Search results
  53. 53. Multilingual search for text objects A focused view on the general strategy Usage scenarios ● Input fulltext to multilingual search ● Enter search query in chosen language ● See search results ● Multilingual search would be extended with fulltext English Outcomes Caveat: no display/UX considerations at this stage!
  54. 54. Multilingual search for text objects ● Automated identification of text object language if needed ● Static translation of text objects to English ● Index fulltext in both English and source language Proposals - indexing ● Automated identification of language of entered query ● Dynamically translate search phrase to English ● Submit query comprising of [original search phrase] + [English translation of search phrase] Proposals - search
  55. 55. Multilingual search for text objects ● How successful is automated language detection? ● What is the projected cost of statically translating fulltext to English? ● Benchmarking of search engine results that compare native language keyword queries with English keyword queries Validation points
  56. 56. What we've done We have tested our cross-lingual search approach on transcriptions of World War I objects from Transcribathons hosted by the Enrich Europeana project. We have used the CEF eTranslation automatic translation serviced and have assessed the prototype with a sample of user queries from the Europeana 1914-1918 thematic collection.
  57. 57. Data acquisition and processing Original corpus: ● 18,257 transcriptions ● 17 languages eTranslation didn't work only in 404 cases: ● Language not supported (Bosnian) ● Long text - can be fixed Text objects (transcriptions) Language tag Transcriptions Translated to English de 9300 9151 fr 1669 1659 it 992 973 ro 578 577 nl 455 454 el 364 356 lv 226 226 bs 215 0 cs 90 90 da 90 90 sl 7 7 hu 3 2 es 2 2 pl 2 2 sk 2 2 hr 1 1 TOTAL (non-en) 13996 13592 en 4243 0 TOTAL 18239 13592
  58. 58. Data acquisition and processing Original corpus: ● Sample from Google Analytics, 10 first months of 2019 ● 91 different queries ● 9 languages eTranslation worked in all cases Queries Language tag Queries Translated to English it 29 29 fr 14 14 de 12 12 pl 6 6 es 3 3 nl 2 2 ro 2 2 cs 1 1 TOTAL (non-en) 69 69 en 22 0 TOTAL 91 69
  59. 59. Results Translation brings more results in! original query language translated query results original query results translated query new docs retrieved thanks to translation domov cs home 2 1529 1527 Bernhard Stiens de Bernhard Stiens 16 21 8 cimitero de ciemitero 0 0 0 eastern front de Eastern front 345 1272 955 lagazuoi de lapiönoi 0 0 0 letters de letters 25 1935 1913 nova vas de Nova vas 4 31 29 Pinsk de Pinsk 1 1 0 podgora de podgora 1 7 6 Rokitno de Roitno 0 0 0 san elia de San elia 40 49 16 Talies de Talies 0 2 2 women de women 4 255 251 antonio sordi it Antonio Deaf 12 25 14 Asiago it Asiago 1) 4 2552 2548 avion it Avion 0 4 4 bini cima it Bini top 3 837 835 celle lager it lager cells 2 56 56
  60. 60. Example Kriegstagebuch von Peter Arabin contributed by Sigrid Arabin-Möhrer CC-BY-SA https://www.europeana.eu/portal/en/record/2020601/http s___1914_1918_europeana_eu_contributions_6461.html
  61. 61. Evaluation We didn't have time to do a fine-grained evaluation of the relevance of results, especially for accuracy original query language translated query results original query results translated query new docs retrieved thanks to translation domov cs home 2 1529 1527 Bernhard Stiens de Bernhard Stiens 16 21 8 cimitero de ciemitero 0 0 0 eastern front de Eastern front 345 1272 955 lagazuoi de lapiönoi 0 0 0 letters de letters 25 1935 1913 nova vas de Nova vas 4 31 29 Pinsk de Pinsk 1 1 0 podgora de podgora 1 7 6 Rokitno de Roitno 0 0 0 san elia de San elia 40 49 16 Talies de Talies 0 2 2 women de women 4 255 251 antonio sordi it Antonio Deaf 12 25 14 Asiago it Asiago 1) 4 2552 2548 avion it Avion 0 4 4 bini cima it Bini top 3 837 835 celle lager it lager cells 2 56 56 What price are we ready to pay for such results?
  62. 62. Evaluation 1 - reproducing original results with translations For each language, we tested the overlap between results without translation & results with translation, for queries and docs in that language ● 67% original results are retrieved after translation. Extrapolation: we can expect that if we use translation we could discover 67% of the records in other languages that are more likely to be good. ● 49% of translation-based results are confirmed in the original language. Extrapolation: we would have to assume that 51% of the results are more likely to be noisy. This is interesting but we need more evaluation, especially since ● We could do it only for 5 languages (in others the original queries had 0 results). ● We cannot assess possible beneficial side effects of translation over monolingual case, such as matching synonyms.
  63. 63. Evaluation 2 - evaluating query translations Assessing the quality of translations for the 69 non-English queries original query (WWI collection) language translated query good translation bad translation wrong language named entity, no transl. applicable named entity, transl. applicable [...] domov cs home 1 Bernhard Stiens de Bernhard Stiens 1 cimitero de ciemitero 1 1 eastern front de Eastern front 1 lagazuoi de lapiönoi 1 1 letters de letters 1 nova vas de Nova vas 1 Pinsk de Pinsk 1 podgora de podgora 1 Rokitno de Roitno 1 1 san elia de San elia 1 Talies de Talies 1 women de women 1 antonio sordi it Antonio Deaf 1 1 Asiago it Asiago 1) 1 1 avion it Avion bini cima it Bini top 1 1 celle lager it lager cells 1 1 1 cellelager it celager eastern front it Eastern front 1 fogliano it Fogliano 1 gaudioso matteo it Mr Matteo 1 1 gay flavio it Mr Gay Flavio 1 1 germania it Germany 1 1
  64. 64. Evaluation 2 - evaluating query translations Winnowing the original set ● In 22 cases the system was given wrong input, like typos or wrong language (einsenbahn in French?) ● In 4 cases we couldn't guess the user's intention (avion on the Italian portal) On the remaining 43 queries ● 37 queries were entities to be left unchanged, e.g., Bernhard Stiens (as opposed to Italia). eTranslation correctly handled 20 of them (54%). ● eTranslation correctly translated 5 of the 6 remaining cases (83%). Frankreich, Avion.- Soldatenfriedhof, Bundesarchiv, CC-BY-SA http://www.bild.bundesarchiv.de/archives/barchpic/search/_1268685391/ General observation: in our case, we're straight into the long tail of the queries
  65. 65. Future work ● Really evaluate the relevance of cross-lingual search results ● Scale up ● Extend to metadata ● Evaluate the impact of cross-lingual search on search performance ● Better handle named entities ● Better language identification ● Decide if query translation is really the way to go...
  66. 66. The Chinese Market, 1767 - 1769, Rijksmuseum, Netherlands, Public domain europeana.eu @EuropeanaEU

×