• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Improving the quality of a customized SMT system using shared training data
 

Improving the quality of a customized SMT system using shared training data

on

  • 3,063 views

 

Statistics

Views

Total Views
3,063
Views on SlideShare
2,827
Embed Views
236

Actions

Likes
0
Downloads
20
Comments
0

9 Embeds 236

http://www.tausdata.org 165
https://www.tausdata.org 36
http://www.translationautomation.com 26
http://www.dialecticaconsulting.es 3
http://www.slideshare.net 2
https://204.232.194.7 1
http://web2.tausdata.org:8085 1
http://test.translationautomation.com 1
http://translationautomation.com 1
More...

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • 2 things:Show that pooling works, especially for data owners with less bitext than MicrosoftCustomization gives the data a boost
  • Not a surprise, we strongly prefer the lambda target’s style and terminology in this case
  • Not a surprise, we strongly prefer the lambda target’s style and terminology in this case
  • Not a surprise, we strongly prefer the lambda target’s style and terminology in this case
  • Not a surprise, we strongly prefer the lambda target’s style and terminology in this case
  • Not a surprise, we strongly prefer the lambda target’s style and terminology in this case
  • System 1:- General test set BLEU = 0.1590- Tech* test set BLEU = 0.2890- TAUS Adobe test set BLEU = 0.1940 System 3b:- General test set BLEU = 0.1353- Tech* test set BLEU = 0.3388- TAUS Adobe test set BLEU = 0.3374 Training Data (lines/words):- Gendom: 1520177 lines / 22632777 enu words 19988095 plk words- Tech: 1786035 lines / 22717903 enu words 21205994 plk words- TAUS: 199210 lines / 2439361 enu words 2306301 plk words- TAUS Adobe: 129084 lines / 1664918 enu words 1512067 plk words
  • System 1:- General test set BLEU = 0.1799- Tech* test set BLEU = 0.3788- TAUS Dell test set BLEU = 0.2672 System 2a:- General test set BLEU = 0.1476- Tech* test set BLEU = 0.3087- TAUS Dell test set BLEU = 0.3949 System 2b:- General test set BLEU = 0.1728- Tech* test set BLEU = 0.4132- TAUS Dell test set BLEU = 0.3264 System 3a:- General test set BLEU = 0.1733- Tech* test set BLEU = 0.4230- TAUS Dell test set BLEU = 0.3989 System 3b:- General test set BLEU = 0.1485- Tech* test set BLEU = 0.3221- TAUS Dell test set BLEU = 0.4243 Training Data:- Gendom: 4348176 lines- Tech: 3299908 lines- TAUS: 1612637 lines- TAUS Dell: 172017 lines * = new set of Tech test sentences deduped against entire training set

Improving the quality of a customized SMT system using shared training data Improving the quality of a customized SMT system using shared training data Presentation Transcript

  • Improving the quality of a customized SMT system using shared training data
    Chris.Wendt@microsoft.com
    Will.Lewis@microsoft.com
    August 28, 2009
    1
  • Overview
    Engine and Customization Basics
    Experiment Objective
    Experiment Setup
    Experiment Results
    Validation
    Conclusions
    2
  • Microsoft’s Statistical MT Engine
    3
    Linguistically informed SMT
  • Microsoft Translator Runtime
    4
  • Training
    5
  • Microsoft’s Statistical MT Engine
    6
  • Adding Domain Specificity
    7
  • Experiment Objective
    Objective
    Determine the effect of pooling parallel data among multiple data providers within a domain, measured by the translation quality of an SMT system trained with that data.
    8
  • Experiment Setup
    Data pool: TAUS Data Association’s repository of parallel translation data.
    Domain: computer-related technical documents.
    No difference is made between software, hardware, documentation and marketing material.
    Criteria for test case selection:
    • More than 100,000 segments of parallel training data
    • Less than 2M segments of parallel training data (at that point it would be valid to train a System using only the provider’s own data)
    Chosen case: Sybase
    Experiment Series: Observe BLEU scores using a reserved subset of the submitted data against systems trained with
    1 General data, as used for www.microsofttranslator.com
    2a Only Microsoft’s internal parallel data, from localization of its own products
    2b Microsoft data + Sybase data
    3a General + Microsoft + TAUS
    3b General + Microsoft data + TAUS, with Sybase custom lambdas
    Measure BLEU on 3 sets of test documents, with 1 reference, reserved from the submission, not used in training:
    • Sybase
    • Microsoft
    • General
    9
  • System Details
    10
  • Training data composition
    German
    Chinese (Simplified)
    Sybase does not have enough data to build a system exclusively with Sybase data
    11
  • Experiment Results, measured in BLEU
    Chinese
    German
    12
  • Experiment Results, measured in BLEU
    Chinese
    German
    13
  • Experiment Results, measured in BLEU
    Chinese
    German
    14
    More than 8 point gain compared to system built without the shared data
  • Experiment Results, measured in BLEU
    Chinese
    Best results are achieved using the maximum available data within the domain, using custom lambda training
    German
    15
  • Experiment Results, measured in BLEU
    Chinese
    Weight training (lambda training) without diversity in the training data has very little effect
    German
    16
    The diversity aspect was somewhat a surprise for us. Microsoft’s large data pool by itself did not give Sybase the hoped-for boost.
  • Experiment Results, measured in BLEU
    Chinese
    Lambda training with in-domain diversity has a significant positive effect for the lambda target, and a significant negative effect for everyone else
    German
    17
  • Experiment Results, measured in BLEU
    Chinese
    A system can be customized with small amounts of target language material, as long as there is a diverse set of in-domain parallel data available
    German
    18
  • Experiment Results, measured in BLEU
    Chinese
    Small data providers benefit more from sharing than large data providers, but all benefit
    German
    19
  • Experiment Results, measured in BLEU
    Chinese
    This is the best German Sybase system we could have built without TAUS
    German
    20
  • Validation: Adobe Polish
    Training Data (sentences):
    General 1.5M
    Microsoft 1.7M
    Adobe 129K
    TAUS other 70K
    21
    Even for a language without a lot of training data we can see nice gains by pooling.
  • Validation: Dell Japanese
    Training data (sentences)
    • General 4.3M
    • Microsoft 3.2M
    • TAUS 1.4M
    • Dell 172K
    22
    Confirms the Sybase results
  • Example
    SRC The Monitor collects metrics and performance data from the databases and MobiLink servers running on other computers, while a separate computer accesses the Monitor via a web browser.
    1 Der Monitor sammelt Metriken und Leistungsdaten von Datenbanken und MobiLink-Servern, die auf anderen Computern ausführen, während auf ein separater Computer greift auf den Monitor über einen Web-Browser.
    2a Der Monitor sammelt Metriken und Performance-Daten von der Datenbanken und MobiLink-Server auf anderen Computern ausgeführt werden, während ein separater Computer den Monitor über einen Webbrowser zugreift.
    2b Der Monitor sammelt Metriken und Performance-Daten von der Datenbanken und MobiLink-Server auf anderen Computern ausgeführt werden, während ein separater Computer den Monitor über einen Webbrowser zugreift.
    3a Der Monitor sammelt Metriken und Performance-Daten von der Datenbanken und MobiLink-Server auf anderen Computern ausgeführt werden, während ein separater Computer den Monitor über einen Webbrowser zugreift.
    3b Der Monitor sammelt Kriterien und Performance-Daten aus der Datenbanken und MobiLink-Server auf anderen Computern ausgeführt werden, während ein separater Computer des Monitors über einen Webbrowser zugreift.
    REF Der Monitor sammelt Kriterien und Performance-Daten aus den Datenbanken und MobiLink-Servern die auf anderen Computern ausgeführt werden, während ein separater Computer auf den Monitor über einen Webbrowser zugreift.
    Google Der Monitor sammelt Metriken und Performance-Daten aus den Datenbanken und MobiLink-Server auf anderen Computern ausgeführt, während eine separate Computer auf dem Monitor über einen Web-Browser.
    23
  • Observations
    • Combining in-domain training data gives a significant boost to MT quality. In our experiment more than 8 BLEU points compared to the best System built without the shared data.
    • Weight training (Lambda training) without diversity in the training data has almost no effect
    • Lambda training with in-domain diversity has a significant positive effect for the lambda target, and a significant negative effect for everyone else
    • A system can be customized with small amounts of target language material, as long as there is a diverse set of in-domain parallel data available
    • Best results are achieved using the maximum available data within the domain, using custom lambda training
    • Small data providers benefit more from sharing than large data providers, but all benefit
    24
  • Results
    • There is noticeable benefit in sharing parallel data among multiple data owners within the same domain, as is the intent of the TAUS Data Association.
    • An MT system trained with the combined data can deliver significantly improved translation quality, compared to a system trained only with the provider’s own data plus baseline training.
    • Customization via a separate target language model and lambda training works
    25
  • References
    Chris Quirk, Arul Menezes, and Colin Cherry, Dependency Treelet Translation: Syntactically Informed Phrasal SMT, in Proceedings of ACL, Association for Computational Linguistics, June 2005
    Microsoft Translator: www.microsofttranslator.com
    TAUS Data Association: www.tausdata.org
    26