A proposal developed jointly by FALCON (www.falcon-project.eu) and LIDER (www.lider-project.eu) projects for Open Data Management for Public Automated Translation Services. This was offered as input to the MLi project, which is capturing procurement requirements for future Automated Translation service under the EU’s Connecting Europe Facilities, CEF.AT
Open Data Management for Public Automated Translation
1. Open Data Management for
Public Automated Translation
Dave Lewis
CNGL at Trinity College Dublin
2. • Open Data on the Web: W3C Semantic Web
standards allow data to be published on Web
– Fine-grained URI-based inter-linking
– Extensible meta-data
– Standard Query APIs
• Enables a Localization Web
– Terms and translations become linkable resources
– Meta-data from L10n workflows adds value
– Leverage in training Machine Translation and Text
Analytics
The Localization Web
The Localization Web = Decentralised Annotated
Global Translation Memory and Term Base
3. Linguistic Linked Data
Red
Phonetic form
Form
number
singular
[RED]
Form
plural
[REDES]
Phonetic form
number
Red
Sense
written form
“red”
Sense
written form
“malla”
equivalent
Red
image
Red
Sense Sense
translation
es - en
written form
“red” “network”
written form
Red
written form
Form
gender
femenine
“red”
5. Words as Resources on the Web
Barak Obama if the
44th president of the
United State of
America. He was
first elected in 2009.
Barak Obama si el 44 º
presidente de los
Estados Unidos de
América. Ha fue electo
primera vez en 2009.
http:// www.ex.org/obama_en.html
http:// www.ex.org/obama_es.html
The Web of Content The Localization Web
http://data.ex.org/String_0001
http:// data.ex.org/String_0002
Derived
From
Derived
From
Text: “Barak Obama
if the 44th president
of the United State of
America.”
Lang:en
Text:“Barak Obama si
el 44 º presidente de
los Estados Unidos de
América.”
Lang:es
TranslatedBy:Google
Translate
Translated
From
Translation Data
Term: “United State
of America.”
Lang:en
Term:“Estados
Unidos de América.”
Lang:es
Translation
Of
http:// babelnet.org/345621
http:// babelnet.org/57835
Terminology Data
Topic: Barak Obama
Lang: en
BirthDate: 1961-08-04
Spouse: Michelle Obama
Residence: White House
http:// Dbpedia.org/Page/
Barak_Obama
Encyclopaedic Data
6. Data Management Lifecycles
Publish
Correct
& refine
Lex-
concept
lifecycleCorrect
& refine
Discover
& use
Discover &
use
Correct
& refine
Bitext
lifecycle
Discover
data
(Re)train
-MT
Revise and
annotate
Publish
Content
lifecycle
Publish
I18n &
source QA
Trans
QA
Post-
edit
Automated
translation
Consume Create
7. • Assert ownership and attribution,
licensing, access control?
• Persistent URLs?
• Open royalty free standards?
• Indexing is key, federated vs aggregated
data?
• Third party submission of errors, QA,
corrections? Publish actions taken on
submissions?
Public Automated Translation:
Data Management Needs?
8. • Query bitext on:
– languages, terms, MT engine, MT confidence, QA,
translator qualifications, postedit characteristics?
• Query lexical-conceptual resources on:
– language, domains, context, semantic relations,
provenance of lexical/conceptual data?
• Localisation Web API:
– HTTP content negotiation (Unicode extensions for
translation?),
– Resource Format: RDF, TMX, TBX, RDF, JSON-LD,
– SPARQL end points?
Public Automated Translation:
Data Access Needs?
9. Liaisons and Consensus Building
ITS2.0
XLIFF
Content
Processing
(DOM)
Linked
Data
Processing
(RDF)
MQM
ITS2.0
Ontology
PROV-‐O/Global
Intelligent
Content
LinguisHc
Linked
Data
OCELOT
Use
Cases:
Content
AnalyHcs;
Corpora
curaHon;
Content
enrichment;
Human-‐MT
quality;
Your
use
case
Linked
Data
for
Language
Technology
10. More Information
• Contact: dave.lewis@cs.tcd.ie
• https://www.w3.org/International/its/wiki/
Open_Data_Management_for_Public_Automated_Translation_Services
• http://www.falcon-project.eu
• See also:
– Linked Data for Language Technology (LD4LT) W3C
Community Group
– http://www.w3.org/community/ld4lt/