FALCON: Building the Localization Web
Andrzej Zydroń
XTM International
Dave Lewis
CNGL at Trinity College Dublin
• Open Data on the Web: W3C Semantic Web
standards allow data to be published on Web
– Fine-grained URI-based inter-linking
– Extensible meta-data
– Standard Query APIs
• Enables a Localization Web
– Terms and translations become linkable resources
– Meta-data from L10n workflows adds value
– Leverage in training Machine Translation and Text
Analytics
The Localization Web
The Localization Web = Decentralised Annotated
Global Translation Memory and Term Base
Linguistic Linked Data: Lexicons
Red
Phonetic form
Form
number
singular
[RED]
Form
plural
[REDES]
Phonetic form
number
Red
Sense
written form
“red”
Sense
written form
“malla”
equivalent
Red
image
Red
Sense Sense
translation
es - en
written form
“red” “network”
written form
Red
written form
Form
gender
femenine
“red”
Use Case: BabelNet.org
Words as Resources on the Web
Barak Obama if the
44th president of the
United State of
America. He was first
elected in 2009.
Barak Obama si el 44 º
presidente de los Estados
Unidos de América. Ha
fue electo primera vez en
2009.
http:// www.ex.org/obama_en.html
http:// www.ex.org/obama_es.html
The Web of Content The Localization Web
http://data.ex.org/String_0001
http:// data.ex.org/String_0002
Derived
From
Derived
From
Text: “Barak Obama if
the 44th president of the
United State of
America.”
Lang:en
Text:“Barak Obama si el
44 º presidente de los
Estados Unidos de
América.”
Lang:es
TranslatedBy:Google
Translate
Translated
From
Translation Data
Term: “United State of
America.”
Lang:en
Term:“Estados Unidos
de América.”
Lang:es
Translation
Of
http:// babelnet.org/345621
http:// babelnet.org/57835
Terminology Data
Topic: Barak Obama
Lang: en
BirthDate: 1961-08-04
Spouse: Michelle Obama
Residence: White House
http:// Dbpedia.org/Page/
Barak_Obama
Encyclopaedic Data
Langua
ge
Resour
ces
Langua
ge
Worker
s
Langua
ge
Techno
logy
Language Lifecycle Dependencies
Parallel
Text &
Term
base
Postedi
tors
Machin
e
Transla
tion
Active Curation: Dynamic MT Retraining
• Tighten curation cycle: from projects to
segments
– Prioritise postedits for retraining
• Prioritise Term
Identification by
posteditors
• Assemble MT-ready,
lexically-rich term bases
• TermWeb/XTM/DCU
• Introducing Next Gen Machine Translation
• Massive scale bilingual dictionaries
• BabelNet
• NER: forced decoding
• Dynamic retraining
• Optimal segment translation route
• L3Data curation, sharing
Next Generation Machine Translation
Data Management Lifecycles
Publish
Correct &
refine
Lex-
concept
lifecycleCorrect &
refine
Discover &
use
Discover &
use
Correct &
refine
Bitext lifecycle
Discover
data
(Re)train-
MT
Revise and
annotate
Publish
Content
lifecycle
Publish
I18n &
source QA
Trans
QA
Post-
edit
Automated
translation
Consume Create
• Better in-context postediting:
– XTM-Easyling
• Feeding term suggestions from posteditor to Terminology Management
– XTM-Interverbum
• Dynamic Retraining
– XTM-DCU
• Bilingual Dictionary SMT improvements
– XTM-DCU
• NER, terminology enforcements, forced decoding
– XTM-Interverbum-DCU
• Postediting prioritisation and term flagging
– TCD-DCU-XTM
• Publishing interlinks of parallel text, lexically rich term bases
– TCD: DG-T TM, EurVoc, Snomed-CT, LEMON, BabelNet
• Closing the loop – operational instrumentation of postediting
– Translog, iOmegaT
Systems Integration
More Information
• Contact: dave.lewis@cs.tcd.ie
• https://www.w3.org/International/its/wiki/Open_Data_Management_for_Public_Au
tomated_Translation_Services
• http://www.falcon-project.eu
• See also:
– Linked Data for Language Technology (LD4LT) W3C
Community Group
– http://www.w3.org/community/ld4lt/

Interverbum falcon-10oct14-az

  • 1.
    FALCON: Building theLocalization Web Andrzej Zydroń XTM International Dave Lewis CNGL at Trinity College Dublin
  • 2.
    • Open Dataon the Web: W3C Semantic Web standards allow data to be published on Web – Fine-grained URI-based inter-linking – Extensible meta-data – Standard Query APIs • Enables a Localization Web – Terms and translations become linkable resources – Meta-data from L10n workflows adds value – Leverage in training Machine Translation and Text Analytics The Localization Web The Localization Web = Decentralised Annotated Global Translation Memory and Term Base
  • 3.
    Linguistic Linked Data:Lexicons Red Phonetic form Form number singular [RED] Form plural [REDES] Phonetic form number Red Sense written form “red” Sense written form “malla” equivalent Red image Red Sense Sense translation es - en written form “red” “network” written form Red written form Form gender femenine “red”
  • 4.
  • 5.
    Words as Resourceson the Web Barak Obama if the 44th president of the United State of America. He was first elected in 2009. Barak Obama si el 44 º presidente de los Estados Unidos de América. Ha fue electo primera vez en 2009. http:// www.ex.org/obama_en.html http:// www.ex.org/obama_es.html The Web of Content The Localization Web http://data.ex.org/String_0001 http:// data.ex.org/String_0002 Derived From Derived From Text: “Barak Obama if the 44th president of the United State of America.” Lang:en Text:“Barak Obama si el 44 º presidente de los Estados Unidos de América.” Lang:es TranslatedBy:Google Translate Translated From Translation Data Term: “United State of America.” Lang:en Term:“Estados Unidos de América.” Lang:es Translation Of http:// babelnet.org/345621 http:// babelnet.org/57835 Terminology Data Topic: Barak Obama Lang: en BirthDate: 1961-08-04 Spouse: Michelle Obama Residence: White House http:// Dbpedia.org/Page/ Barak_Obama Encyclopaedic Data
  • 6.
  • 7.
    Parallel Text & Term base Postedi tors Machin e Transla tion Active Curation:Dynamic MT Retraining • Tighten curation cycle: from projects to segments – Prioritise postedits for retraining • Prioritise Term Identification by posteditors • Assemble MT-ready, lexically-rich term bases
  • 8.
    • TermWeb/XTM/DCU • IntroducingNext Gen Machine Translation • Massive scale bilingual dictionaries • BabelNet • NER: forced decoding • Dynamic retraining • Optimal segment translation route • L3Data curation, sharing Next Generation Machine Translation
  • 9.
    Data Management Lifecycles Publish Correct& refine Lex- concept lifecycleCorrect & refine Discover & use Discover & use Correct & refine Bitext lifecycle Discover data (Re)train- MT Revise and annotate Publish Content lifecycle Publish I18n & source QA Trans QA Post- edit Automated translation Consume Create
  • 10.
    • Better in-contextpostediting: – XTM-Easyling • Feeding term suggestions from posteditor to Terminology Management – XTM-Interverbum • Dynamic Retraining – XTM-DCU • Bilingual Dictionary SMT improvements – XTM-DCU • NER, terminology enforcements, forced decoding – XTM-Interverbum-DCU • Postediting prioritisation and term flagging – TCD-DCU-XTM • Publishing interlinks of parallel text, lexically rich term bases – TCD: DG-T TM, EurVoc, Snomed-CT, LEMON, BabelNet • Closing the loop – operational instrumentation of postediting – Translog, iOmegaT Systems Integration
  • 11.
    More Information • Contact:dave.lewis@cs.tcd.ie • https://www.w3.org/International/its/wiki/Open_Data_Management_for_Public_Au tomated_Translation_Services • http://www.falcon-project.eu • See also: – Linked Data for Language Technology (LD4LT) W3C Community Group – http://www.w3.org/community/ld4lt/