Active Curation of Bi-Text Resources in Commercial Localization Workflows

•

1 like•383 views

Explains how validated links between segment translation and multilingual terminology from different sources can be captured as linguistic linked data from localisation workflows. This is based on work in the FALCON project (www.falcon-project.eu)

Data & Analytics

Active Curation of Bi-Text Resources in
Commercial Localization Workflows
Dave Lewis TCD, Andrzej Zydroń
XTM International

•  Open Data on the Web: W3C Semantic Web
standards allow data to be published on Web
– Fine-grained URI-based inter-linking
– Extensible meta-data
– Standard Query APIs
•  Enables a Localization Web
– Terms and translations become linkable resources
– Meta-data from L10n workflows adds value
– Leverage in training Machine Translation and
Automatic Term Extraction
The Localization Web
The Localization Web = Decentralised Annotated Global
Translation Memory and Term Base

•  Rich word
and
phrase
resources
to assist
translators
Babelfy: Public Lexical Resources

•  Translation
suggestions
can be fed
into MT for
more reliable
translation
Links to BabelNet offer suggestions
for Definitions and Translations

Babelfy & Babelnet offer more term
suggestions

•  Public resources
may not always
yield the right
definitions or
translations for the
context
•  Need to track
human validation/
rejection to train
automatic term
extraction

Active Curation of Linked Language
Resources
The company has also reduced its production
capacity by ceasing manufacture of chest
freezers and freestanding microwave ovens
Extraction &
Segmentation
production capacity
capacité de production
✔
✔ Annotation with
Existing Terms
chest freezer
microwave oven
réfrigérateur
four à micro-onde
?
?
?
?
Auto suggestion from
Babelfy/Babelnet
D'autre part, la société a réduit sa capacité de
production en arrêtant la production de
réfrigérateur et de fours micro-onde pose-libre
Machine Translate
with Term Translations
MT Vendor?
D'autre part, la société a réduit sa capacité de
production en arrêtant la production de
congélateurs coffres et do fours micro-ondes
pose-libre
✗
congélateurs coffres
fours micro-ondes
✔
Postedit and capture
terms in context
✔✔
✔
✔
✔
✔
PE
PE
PE
PE
PE
PE
PE
✗
PE
✔

•  CSV of the Web: tables and JSON meta-
data
•  JSON-Linked Data
•  Provenance Vocabulary
•  Data Catalogue
•  Open Annotation
•  ITS2.0 Vocabulary
•  Also:
– Provenance Plan
– Open Data Rights Language
Linked Data Based on W3C
Standards

Language
Resource
s
Language
Workers
Language
Technology
Language Lifecycle Dependencies

Parallel
Text &
Term base
Posteditors
Machine
Translation
Active Curation: Dynamic MT Retraining
•  Tighten curation cycle: from projects to
segments
– Prioritise postedits for retraining
•  Prioritise Term
Identification by
posteditors
•  Assemble MT-ready,
lexically-rich term bases

•  TermWeb/XTM/DCU
•  Introducing Next Gen Machine Translation
•  Massive scale bilingual dictionaries
•  BabelNet
•  Automatic Term Extraction: forced
decoding
•  Dynamic retraining
•  Optimal segment translation route
•  L3Data curation, sharing
Next Generation Machine Translation

Data Management Lifecycles
Publish
Correct &
refine
Lex-
concept
lifecycleCorrect &
refine
Discover &
use
Discover &
use
Correct &
refine
Bitext lifecycle
Discover
data
(Re)train-
MT
Revise and
annotate
Publish
Content
lifecycle
Publish
I18n &
source QA
Trans
QA
Post-
edit
Automated
translation
Consume Create

•  Better in-context postediting:
– XTM-Easyling
•  Feeding term suggestions from posteditor to Terminology Management
– XTM-Interverbum
•  Dynamic Retraining
– XTM-DCU
•  Bilingual Dictionary SMT improvements
– XTM-DCU
•  NER, terminology enforcements, forced decoding
– XTM-Interverbum-DCU
•  Postediting prioritisation and term flagging
– TCD-DCU-XTM
•  Publishing interlinks of parallel text, lexically rich term bases
– TCD: DG-T TM, EurVoc, Snomed-CT, LEMON, BabelNet
•  Closing the loop – operational instrumentation of postediting
– XTM
Systems Integration

Similar to Active Curation of Bi-Text Resources in Commercial Localization Workflows

EDF2013: Selected talk by David Lewis: Linked Data Reuse in the Language Serv...European Data Forum

Sp2010 high availlabilitySamuel Zürcher

J1 - Keynote Data Platform - Rohan KumarMS Cloud Summit

Modernize Your Infrastructure and Apps with Microsoft AzureWinWire Technologies Inc

A Tight Ship: How Containers and SDS Optimize the EnterpriseEric Kavanagh

The Standards Mosaic Opening the Way to New TechnologiesDave Lewis

Why I don't use Semantic Web technologies anymore, event if they still influe...Gautier Poupeau

SharePoint Saturday Toronto 2015 - Inside the mind of a SharePoint ArchitectNoorez Khamis

song.pptDrSiddharthMisra

song.pptssuser5c6663

song.pptnameno76

Introduction On Integrating Translation Applications, Wcm To Achieve A Common...Dhiraj Makam

How companies-use-no sql-and-couchbase-10152013Dipti Borkar

Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)Sascha Wenninger

Open for Business - Open Archives, OpenURL, RSS and the Dublin CoreAndy Powell

Real-world software design practices when developing ASP.NET web systems by B...Bojan Veljanovski

MongoDB Partner Program Update - November 2013MongoDB

NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...National Information Standards Organization (NISO)

Innovation in the Enterprise Rent-A-Car Data WarehouseDataWorks Summit

Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudDataWorks Summit

Similar to Active Curation of Bi-Text Resources in Commercial Localization Workflows (20)

EDF2013: Selected talk by David Lewis: Linked Data Reuse in the Language Serv...

Sp2010 high availlability

J1 - Keynote Data Platform - Rohan Kumar

Modernize Your Infrastructure and Apps with Microsoft Azure

A Tight Ship: How Containers and SDS Optimize the Enterprise

The Standards Mosaic Opening the Way to New Technologies

Why I don't use Semantic Web technologies anymore, event if they still influe...

SharePoint Saturday Toronto 2015 - Inside the mind of a SharePoint Architect

song.ppt

Introduction On Integrating Translation Applications, Wcm To Achieve A Common...

How companies-use-no sql-and-couchbase-10152013

Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)

Open for Business - Open Archives, OpenURL, RSS and the Dublin Core

Real-world software design practices when developing ASP.NET web systems by B...

MongoDB Partner Program Update - November 2013

NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...

Innovation in the Enterprise Rent-A-Car Data Warehouse

Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud

Recently uploaded

定制英国白金汉大学毕业证（UCB毕业证书）成绩单原版一比一ffjhghh

Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics

Brighton SEO | April 2024 | Data StorytellingNeil Barnes

Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten

Industrialised data - the key to AI success.pdfLars Albertsson

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha

Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo

Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823

Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda

Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863

E-Commerce Order PredictionShraddha Kamble.pptxBoston Institute of Analytics

VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H

FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna

Ukraine War presentation: KNOW THE BASICSAishani27

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor

Recently uploaded (20)

定制英国白金汉大学毕业证（UCB毕业证书）成绩单原版一比一

Predicting Employee Churn: A Data-Driven Approach Project Presentation

Brighton SEO | April 2024 | Data Storytelling

Log Analysis using OSSEC sasoasasasas.pptx

Industrialised data - the key to AI success.pdf

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...

Schema on read is obsolete. Welcome metaprogramming..pdf

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改

Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow

Customer Service Analytics - Make Sense of All Your Data.pptx

Dubai Call Girls Wifey O52&786472 Call Girls Dubai

E-Commerce Order PredictionShraddha Kamble.pptx

VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf

FESE Capital Markets Fact Sheet 2024 Q1.pdf

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...

Ukraine War presentation: KNOW THE BASICS

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai

Active Curation of Bi-Text Resources in Commercial Localization Workflows

1. Active Curation of Bi-Text Resources in Commercial Localization Workflows Dave Lewis TCD, Andrzej Zydroń XTM International

2. •  Open Data on the Web: W3C Semantic Web standards allow data to be published on Web – Fine-grained URI-based inter-linking – Extensible meta-data – Standard Query APIs •  Enables a Localization Web – Terms and translations become linkable resources – Meta-data from L10n workflows adds value – Leverage in training Machine Translation and Automatic Term Extraction The Localization Web The Localization Web = Decentralised Annotated Global Translation Memory and Term Base

3. Web of Multilingual Content

4. Domain Terminology

5. •  Rich word and phrase resources to assist translators Babelfy: Public Lexical Resources

6. •  Translation suggestions can be fed into MT for more reliable translation Links to BabelNet offer suggestions for Definitions and Translations

7. Babelfy & Babelnet offer more term suggestions

8. •  Public resources may not always yield the right definitions or translations for the context •  Need to track human validation/ rejection to train automatic term extraction

9. Active Curation of Linked Language Resources The company has also reduced its production capacity by ceasing manufacture of chest freezers and freestanding microwave ovens Extraction & Segmentation production capacity capacité de production ✔ ✔ Annotation with Existing Terms chest freezer microwave oven réfrigérateur four à micro-onde ? ? ? ? Auto suggestion from Babelfy/Babelnet D'autre part, la société a réduit sa capacité de production en arrêtant la production de réfrigérateur et de fours micro-onde pose-libre Machine Translate with Term Translations MT Vendor? D'autre part, la société a réduit sa capacité de production en arrêtant la production de congélateurs coffres et do fours micro-ondes pose-libre ✗ congélateurs coffres fours micro-ondes ✔ Postedit and capture terms in context ✔✔ ✔ ✔ ✔ ✔ PE PE PE PE PE PE PE ✗ PE ✔

10. •  CSV of the Web: tables and JSON meta- data •  JSON-Linked Data •  Provenance Vocabulary •  Data Catalogue •  Open Annotation •  ITS2.0 Vocabulary •  Also: – Provenance Plan – Open Data Rights Language Linked Data Based on W3C Standards

11. Language Resource s Language Workers Language Technology Language Lifecycle Dependencies

12. Parallel Text & Term base Posteditors Machine Translation Active Curation: Dynamic MT Retraining •  Tighten curation cycle: from projects to segments – Prioritise postedits for retraining •  Prioritise Term Identification by posteditors •  Assemble MT-ready, lexically-rich term bases

13. •  TermWeb/XTM/DCU •  Introducing Next Gen Machine Translation •  Massive scale bilingual dictionaries •  BabelNet •  Automatic Term Extraction: forced decoding •  Dynamic retraining •  Optimal segment translation route •  L3Data curation, sharing Next Generation Machine Translation

14. Data Management Lifecycles Publish Correct & refine Lex- concept lifecycleCorrect & refine Discover & use Discover & use Correct & refine Bitext lifecycle Discover data (Re)train- MT Revise and annotate Publish Content lifecycle Publish I18n & source QA Trans QA Post- edit Automated translation Consume Create

15. •  Better in-context postediting: – XTM-Easyling •  Feeding term suggestions from posteditor to Terminology Management – XTM-Interverbum •  Dynamic Retraining – XTM-DCU •  Bilingual Dictionary SMT improvements – XTM-DCU •  NER, terminology enforcements, forced decoding – XTM-Interverbum-DCU •  Postediting prioritisation and term flagging – TCD-DCU-XTM •  Publishing interlinks of parallel text, lexically rich term bases – TCD: DG-T TM, EurVoc, Snomed-CT, LEMON, BabelNet •  Closing the loop – operational instrumentation of postediting – XTM Systems Integration

Active Curation of Bi-Text Resources in Commercial Localization Workflows

Recommended

Recommended

More Related Content

Similar to Active Curation of Bi-Text Resources in Commercial Localization Workflows

Similar to Active Curation of Bi-Text Resources in Commercial Localization Workflows (20)

Recently uploaded

Recently uploaded (20)

Active Curation of Bi-Text Resources in Commercial Localization Workflows