Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2, 25 october 2019
1.
Books on a table, Aalto, Ilmari, 1928, National Digital Library (NDL), Finland, CC0
EUROPEANA MEETING
UNDER FINLAND’S PRESIDENCY
OF THE COUNCIL OF THE EU
ESPOO, FINLAND
25 October 2019
2.
Books on a table, Aalto, Ilmari, 1928, National Digital Library (NDL), Finland, CC0
Andy
Neale
Technical Director
Europeana Foundation
Recap on main conclusions of
Day 1
3.
Content
Information Access
Interactions
User Interface
Metadata and digital
CH objects
Search, Browse & Explore
Show user‘s
preferred language
Bridge the gap between
language of user input
and content
Layers of digital CH system
Juliane
4.
Mismatch between query and
content language
• Mona Lisa 203 results
• Monna Lisa 13 results
• La Gioconda 376 results
• La Joconde 78 results
5
Interactions
Roma, Galleria Corsini - La
Gioconda,
Juliane
5.
Challenges
• Missing training data for small languages
• Missing training data for (sub)domains
• Amount of language pairs is immense with 50+
languages
• Metadata is too scarce for good translation results
6
Juliane
6.
Evaluate solution based on goal
○ E.g. for ML retrieval we might not need the perfect fluent
translation
○ Identify the impact of different workflows / processes on
multilinguality of system
○ Translations do not only have an impact on data but also on
retrieval and therefore on user satisfaction
7
Juliane
7.
Challenges for LT in cultural heritage
● Interface or content (= multilingual in a broad sense)
● Far beyond modern standard language use
● Great variation makes domain adaptation hard
● Variation in place (dialects and languages), time (old Swedish) and
situation (informal-formal)
● Modal variation in collections: (handwritten) text, speech, pictures
● Hard to handle as researchers want to explore a collection as a whole
Rickard
8.
Next steps
● Linked data to describe the collection conceptually and relationally
● Multilingual search methods for handling language variation in place,
time and situation
● Domain adopted speech-to-text conversion to transcribe recordings
● Crowdsourcing for correcting
● Shared resources for the languages, dialects, domains etc
● Long time funding for the National Language Bank
● Collaborative projects involving LTists, researchers and data holders
Rickard
9.
Hugo.lv – AI powered language technology portal
Andrejs & Jānis
10.
Conclusions
• New generation of Neural MT strongly improves quality and applicability of
machine translation, especially for morphology rich languages
• Domain specific data is crucial for making MT suitable for cultural and other
domains
• Depending on the application, translation needs can be served by selecting
the most efficient approach – pure MT, human review of the MT, or fully
human translation
• We will be happy to share our experience, technologies and tools :)
Andrejs & Jānis
11.
Development Implementation Operation and maintenance Initiation
(of a new service)
time
Process-time Use-time Future
Who are involved in
the development and
implementation of
your service?
What kinds of benefits
can be identified?
Who uses your
service? Are there
other stakeholders?
What kinds of benefits
can be identified?
Who could (re)use
your service or
materials in the
(undefined) future?
What kinds of benefits
can be anticipated?
Model for temporal division of benefits
Kautonen, H. & Nieminen, M. (2018): Conceptualizing Benefits of User-Centered Design for Digital Library
Services. Liber Quarterly, 28(1), ss. 1–34. DOI: http://doi.org/10.18352/lq.10231.
Heli
15.
Language detection and display (for validation)
Query translated in 24 languages
Dasha
16.
THE NATIONAL LIBRARY OF FINLAND
Thesaurus to ontology
▪ Reconstruction of YSA into machine-readable and multilingual YSO
▪ Trilingual terms for concepts (fin, swe, eng)
▪ YSA and Allärs merged together and translated into English
▪ Concepts are a compromise between Finnish and Swedish as YSA
and Allärs are not completely identical
▪ Links to Library of Congress Subject Headings (LCSH)
▪ Linking to Wikidata underway
▪ YSO just made the list of Europeana dereferenceable vocabularies
that can be enriched in the Europeana portal
Matias
17.
THE NATIONAL LIBRARY OF FINLAND
Annotate in one language, find using another
Matias
18.
THE NATIONAL LIBRARY OF FINLAND
Automated Subject Indexing made easy:
Annif
▪ An open source multilingual automated subject indexing
system using machine learning and our own vocabularies
Matias
19.
Europeana’s Knowledge Graph
Entity
Collection
Hugo
20.
Proposals for indexing and storing translations
● Automated identification of language if needed (only 26.5% of the data
provider’s metadata is language qualified)
● Use translations from multilingual knowledge graph
● Augment the provider metadata with static translation of the fields to English
(to fill metadata values not covered by the knowledge graph)
● Store and index translated metadata for search and display (original metadata
+ languages of the knowledge graph + English)
Hugo
21.
Proposals for search on object metadata
Identify
language
Original
query
Translate to
English
Multilingual
index
User
Disambiguates
Search
Translated query (English)
Suggest Entity
(Knowledge Graph)
Entity-based query
Multilingual query:
entity based query
OR original query +
translated query
#1: French
#2: Spanish
#3: Polish
Hugo
22.
Session 4
CONTENT TRANSLATION
Europa [Material cartográfico] : Nach den vorzüglichsten Hülfsnitteln, Götze, Johann August Ferdinand, 1773-1819 Biblioteca Digital de Madrid Spain, Public domain
23.
Books on a table, Aalto, Ilmari, 1928, National Digital Library (NDL), Finland, CC0
Tom
Vanallemeersch
Machine translation specialist
CrossLang
The art of automating translation
24.
Cultural heritage and translation
● Translation helps to open up cultures
● Rosetta stone was the key to understanding hieroglyphs
Parallel data
● Systems for automated translation (now) act in a similar way
● However, the right stones are required, and many of them ...
25.
Context of this talk
EC project SMART 2016/0103:
● Identification of language technology needs of Digital Service Infrastructures of EC
E.g. Europeana DSI
● Framework: Connecting Europe Facility – Automated Translation
● Contracting authority: DG CNECT (the EC's multilingual enabler)
● Consortium:
26.
Guide to this talk
Machine translation (MT):
● In general
● In a highly multilingual environment: eTranslation (EC)
● For EU cultural heritage
Challenges:
●Domain imbalance
●Language imbalance
●Context demand
●Multimodal sources
Approaches
27.
MT in general
● MT systems are data-driven
🡪 Sentence pairs: They were living there - Ils habitaient là-bas
🡪 Software consisting of a neural network (like many recent AI applications)
● MT is used for various purposes
🡪 Post-editing, gisting, cross-lingual retrieval
28.
MT in general: domain imbalance
● Quality typically improves when increasing training data
● But there are few (accessible) translations in some domains
● The same problem occurs for specific genres (e.g. novels)
and registers (e.g. informal language)
Difference in amount of domain-specific resources
29.
MT in general: domain imbalance
Approach: identify/create domain-specific data
● Select sentence pairs from the vast ParaCrawl Corpus
● Use the ParaCrawl toolkit for multilingual websites, archives
● Select domain-specific parallel corpora from the ELRC-SHARE repository
● Create artificial training data: e.g. apply MT to French in-domain data,
add the English translations to English-French MT system
Difference in amount of domain-specific resources
30.
Guide to this talk
Machine translation:
● In general
● In a highly multilingual environment: eTranslation (EC)
● For EU cultural heritage
Challenges:
●Domain imbalance
●Language imbalance
●Context demand
●Multimodal sources
Approaches
31.
eTranslation
● 130+ out of 552 language pairs, often from or into English
● Sometimes pivot:
● Management: DG Translation (technical), DG CNECT (EU’s MT policy)
● Users: translators of DG Translation, public administrations in the EEA
● Free use
● Confidentiality and security
MT system for 24 official EU languages + Icelandic and Norwegian (Bokmål)
Finnish English Portuguese
32.
eTranslation
● User interface: snippets, documents
● API: online services, …
● Domain of training data: legal and administrative texts
● Specific MT systems for some organisations
🡪 E.g. Court of Justice (French ⇄ X)
MT system for 24 official EU languages + Icelandic and Norwegian (Bokmål)
33.
eTranslation: language imbalance
● Resource-rich language pairs (many parallel data), e.g. English-French
● Resource-poor language pairs, e.g. English-Irish, English-Icelandic
🡪 Lower MT quality
Difference in amount of training data for language pairs
34.
eTranslation: language imbalance
Approach: build multilingual models
● Recent research topic in MT
● Translation from many languages into one, from one into many, etc.
● Language pairs that “learn” from each other how to translate (pieces of) words
● Surprising improvements for resource-poor language pairs
Difference in amount of training data for language pairs
35.
eTranslation: language imbalance
Approach: build multilingual models (continued)
● Recent workshop in Luxembourg, organised by CrossLang for DG CNECT
🡪 Moderated by high-profile expert from Facebook
● Google AI group: attempts at creating “universal MT” (102 languages for now)
● Opportunity for scaling up MT
Difference in amount of training data for language pairs
36.
Guide to this talk
Machine translation:
● In general
● In a highly multilingual environment: eTranslation (EC)
● For EU cultural heritage
Challenges:
●Domain imbalance
●Language imbalance
●Context demand
●Multimodal sources
Approaches
37.
MT for culture
● Post-editing: e.g. static text on websites
● Gisting: e.g. dynamic text like visitors’ comments
● Cross-lingual retrieval: e.g. search for objects having metadata in another language
Potential uses
38.
MT for culture: context demand
Metadata consisting of short text fragments
Title: note, bank = “financial institution” / “location near river” ?
= “comment” / “money” ?
39.
MT for culture: context demand
Metadata consisting of short text fragments
Title: note, bank Subject: paper money
= “comment” / ”money” ?
🡪 Dutch: biljet
Approach: make use of the remainder of the metadata
40.
MT for culture: context demand
Metadata consisting of short text fragments
Approach: make use of the remainder of the metadata
🡪 Approach is also useful for named entity recognition:
Description: The Utrecht artist De Heem is regarded as one …
Artist: Jan Davidsz de Heem
41.
MTforculture:languageimbalance
Little or no parallel data involving “dead” / minority languages
Approach for related languages: use available data + additional techniques
● Minority language + larger language
● Old + new language variant
● Advantage: similar vocabulary, spelling
42.
MTforculture:languageimbalance
Little or no parallel data involving “dead” / minority languages
Alternative approach for related languages: train an unsupervised MT system
● Uses monolingual corpora for the two languages
● Identifies similar words and sentences in both languages
● Learns to translate in both directions
43.
MT for culture: multimodal sources
Translation in case of non-textual objects (including non-digitised text)
● Audio material
● Scanned documents
● Photographs with text
● Images without text
Speech recognition
OCR
OCR (?)
Text describing image
Imperfect MT input
44.
MT for culture: multimodal sources
Translation in case of non-textual objects (including non-digitised text)
Approach: correct output using metadata before applying MT
OCR: Demer en Capueienen
Metadata: … Capucienen …
45.
Conclusions
● MT for cultural heritage stretches across many dimensions
Languages, domains, genres, registers, periods, …
● It is a particularly interesting and demanding area for MT
Huge potential of multilingual object metadata, big challenges
● Approaches involve new information sources, refinement of tools and methods
46.
Books on a table, Aalto, Ilmari, 1928, National Digital Library (NDL), Finland, CC0
Antoine
Isaac
R&D Manager
Europeana Foundation
Case study -
Content translation and search
47.
Aspects of multilingual experience
- Content
A focused view of our
conceptual model of
multilingual approach
48.
First experiments -
Translation of virtual exhibitions
49.
Translation of virtual exhibitions
Pilot: apply eTranslation to assist
manual translation of exhibitions
● Exhibitions from two Generic Services projects:
○ Migration in the Arts and Sciences
○ Rise of Literacy
● 13 people from 11 institutions reviewed
translations from English into 8 languages:
○ Dutch, French, Hungarian, Italian, Lithuanian,
Polish, Portuguese, Slovenian
NB: no German (for which eTranslation has a
"cultural" version)
50.
Translation of virtual exhibitions
Pilot: apply eTranslation to assist
manual translation of exhibitions
● The output is medium to good but does not
translate well the carefully crafted narrative text,
leading to partners spending a lot of time
rewriting
● The quality is too low yet to translate exhibitions
sustainably and cost-effectively
51.
Ongoing experiments - content
translation and search
New case study: using translation
in search for text objects
● An important need for
Europeana (cf.
Newspapers,
Transcriptions)
● One that may still work
with less-than-perfect
translations
52.
The strategy for using translation
in cross-lingual search
Identify
language
Original
query
Translate to
English
Multilingual
index
User
validation
Search
Translated query (English)
Align to
entity
Entity-based query
Multilingual query:
entity based query
+ original query +
translated query
#1: French
#2: Spanish
#3: Polish
Search results
53.
Multilingual search for text objects
A focused view on the general strategy
Usage scenarios
● Input fulltext to multilingual search
● Enter search query in chosen language
● See search results
● Multilingual search would be extended with fulltext English
Outcomes
Caveat: no display/UX considerations at this stage!
54.
Multilingual search for text objects
● Automated identification of text object language if needed
● Static translation of text objects to English
● Index fulltext in both English and source language
Proposals - indexing
● Automated identification of language of entered query
● Dynamically translate search phrase to English
● Submit query comprising of [original search phrase] + [English translation of search phrase]
Proposals - search
55.
Multilingual search for text objects
● How successful is automated language detection?
● What is the projected cost of statically translating fulltext to
English?
● Benchmarking of search engine results that compare native
language keyword queries with English keyword queries
Validation points
56.
What we've done
We have tested our cross-lingual search approach on transcriptions
of World War I objects from Transcribathons hosted by the Enrich
Europeana project. We have used the CEF eTranslation automatic
translation serviced and have assessed the prototype with a sample
of user queries from the Europeana 1914-1918 thematic collection.
57.
Data acquisition
and processing
Original corpus:
● 18,257 transcriptions
● 17 languages
eTranslation didn't work only in 404 cases:
● Language not supported (Bosnian)
● Long text - can be fixed
Text objects (transcriptions)
Language tag Transcriptions Translated to English
de 9300 9151
fr 1669 1659
it 992 973
ro 578 577
nl 455 454
el 364 356
lv 226 226
bs 215 0
cs 90 90
da 90 90
sl 7 7
hu 3 2
es 2 2
pl 2 2
sk 2 2
hr 1 1
TOTAL (non-en) 13996 13592
en 4243 0
TOTAL 18239 13592
58.
Data acquisition and processing
Original corpus:
● Sample from Google Analytics, 10 first
months of 2019
● 91 different queries
● 9 languages
eTranslation worked in all cases
Queries
Language tag Queries Translated to English
it 29 29
fr 14 14
de 12 12
pl 6 6
es 3 3
nl 2 2
ro 2 2
cs 1 1
TOTAL (non-en) 69 69
en 22 0
TOTAL 91 69
59.
Results Translation brings more results in!
original query language translated query
results original
query
results
translated query
new docs retrieved
thanks to translation
domov cs home 2 1529 1527
Bernhard Stiens de Bernhard Stiens 16 21 8
cimitero de ciemitero 0 0 0
eastern front de Eastern front 345 1272 955
lagazuoi de lapiönoi 0 0 0
letters de letters 25 1935 1913
nova vas de Nova vas 4 31 29
Pinsk de Pinsk 1 1 0
podgora de podgora 1 7 6
Rokitno de Roitno 0 0 0
san elia de San elia 40 49 16
Talies de Talies 0 2 2
women de women 4 255 251
antonio sordi it Antonio Deaf 12 25 14
Asiago it Asiago 1) 4 2552 2548
avion it Avion 0 4 4
bini cima it Bini top 3 837 835
celle lager it lager cells 2 56 56
60.
Example
Kriegstagebuch von Peter Arabin
contributed by Sigrid Arabin-Möhrer
CC-BY-SA
https://www.europeana.eu/portal/en/record/2020601/http
s___1914_1918_europeana_eu_contributions_6461.html
61.
Evaluation
We didn't have time to do a
fine-grained evaluation of the
relevance of results, especially for
accuracy
original query language translated query
results original
query
results translated
query
new docs retrieved
thanks to translation
domov cs home 2 1529 1527
Bernhard Stiens de Bernhard Stiens 16 21 8
cimitero de ciemitero 0 0 0
eastern front de Eastern front 345 1272 955
lagazuoi de lapiönoi 0 0 0
letters de letters 25 1935 1913
nova vas de Nova vas 4 31 29
Pinsk de Pinsk 1 1 0
podgora de podgora 1 7 6
Rokitno de Roitno 0 0 0
san elia de San elia 40 49 16
Talies de Talies 0 2 2
women de women 4 255 251
antonio sordi it Antonio Deaf 12 25 14
Asiago it Asiago 1) 4 2552 2548
avion it Avion 0 4 4
bini cima it Bini top 3 837 835
celle lager it lager cells 2 56 56
What price are we ready to pay for such results?
62.
Evaluation 1 - reproducing original results with
translations
For each language, we tested the overlap between results without
translation & results with translation, for queries and docs in that language
● 67% original results are retrieved after translation.
Extrapolation: we can expect that if we use translation we could discover 67% of the records
in other languages that are more likely to be good.
● 49% of translation-based results are confirmed in the original language.
Extrapolation: we would have to assume that 51% of the results are more likely to be noisy.
This is interesting but we need more evaluation, especially since
● We could do it only for 5 languages (in others the original queries had 0 results).
● We cannot assess possible beneficial side effects of translation over monolingual case, such
as matching synonyms.
63.
Evaluation 2
- evaluating
query
translations
Assessing the quality of
translations for the 69
non-English queries
original query
(WWI collection) language
translated
query
good
translation
bad
translation
wrong
language
named entity,
no transl. applicable
named entity,
transl. applicable [...]
domov cs home 1
Bernhard Stiens de
Bernhard
Stiens 1
cimitero de ciemitero 1 1
eastern front de Eastern front 1
lagazuoi de lapiönoi 1 1
letters de letters 1
nova vas de Nova vas 1
Pinsk de Pinsk 1
podgora de podgora 1
Rokitno de Roitno 1 1
san elia de San elia 1
Talies de Talies 1
women de women 1
antonio sordi it Antonio Deaf 1 1
Asiago it Asiago 1) 1 1
avion it Avion
bini cima it Bini top 1 1
celle lager it lager cells 1 1 1
cellelager it celager
eastern front it Eastern front 1
fogliano it Fogliano 1
gaudioso matteo it Mr Matteo 1 1
gay flavio it Mr Gay Flavio 1 1
germania it Germany 1 1
64.
Evaluation 2 - evaluating query translations
Winnowing the original set
● In 22 cases the system was given wrong input, like
typos or wrong language (einsenbahn in French?)
● In 4 cases we couldn't guess the user's intention
(avion on the Italian portal)
On the remaining 43 queries
● 37 queries were entities to be left unchanged, e.g.,
Bernhard Stiens (as opposed to Italia).
eTranslation correctly handled 20 of them (54%).
● eTranslation correctly translated 5 of the 6
remaining cases (83%).
Frankreich, Avion.- Soldatenfriedhof, Bundesarchiv, CC-BY-SA
http://www.bild.bundesarchiv.de/archives/barchpic/search/_1268685391/
General observation: in our case, we're straight into the long tail of the queries
65.
Future work
● Really evaluate the relevance of cross-lingual search results
● Scale up
● Extend to metadata
● Evaluate the impact of cross-lingual search on search performance
● Better handle named entities
● Better language identification
● Decide if query translation is really the way to go...
66.
The Chinese Market, 1767 - 1769, Rijksmuseum, Netherlands, Public domain
europeana.eu
@EuropeanaEU
It appears that you have an ad-blocker running. By whitelisting SlideShare on your ad-blocker, you are supporting our community of content creators.
Hate ads?
We've updated our privacy policy.
We’ve updated our privacy policy so that we are compliant with changing global privacy regulations and to provide you with insight into the limited ways in which we use your data.
You can read the details below. By accepting, you agree to the updated privacy policy.