Improving the quality of a customized SMT system using shared training data<br />Chris.Wendt@microsoft.com<br />Will.Lewis...
Overview <br />Engine and Customization Basics<br />Experiment Objective<br />Experiment Setup<br />Experiment Results<br ...
Microsoft’s Statistical MT Engine<br />3<br />Linguistically informed SMT<br />
Microsoft Translator Runtime<br />4<br />
Training<br />5<br />
Microsoft’s Statistical MT Engine<br />6<br />
Adding Domain Specificity <br />7<br />
Experiment Objective<br />Objective<br />Determine the effect of pooling parallel data among multiple data providers withi...
Experiment Setup<br />Data pool: TAUS Data Association’s repository of parallel translation data.<br />Domain: computer-re...
Less than 2M segments of parallel training data (at that point it would be valid to train a System using only the provider...
Microsoft
General</li></ul>9<br />
System Details<br />10<br />
Training data composition<br />German<br />Chinese (Simplified)<br />Sybase does not have enough data to build a system ex...
Experiment Results, measured in BLEU<br />Chinese<br />German<br />12<br />
Experiment Results, measured in BLEU<br />Chinese<br />German<br />13<br />
Experiment Results, measured in BLEU<br />Chinese<br />German<br />14<br />More than 8 point gain compared to system built...
Experiment Results, measured in BLEU<br />Chinese<br />Best results are achieved using the maximum available data within t...
Experiment Results, measured in BLEU<br />Chinese<br />Weight training (lambda training) without diversity in the training...
Experiment Results, measured in BLEU<br />Chinese<br />Lambda training with in-domain diversity has a significant positive...
Experiment Results, measured in BLEU<br />Chinese<br />A system can be customized with small amounts of target language ma...
Experiment Results, measured in BLEU<br />Chinese<br />Small data providers benefit more from sharing than large data prov...
Experiment Results, measured in BLEU<br />Chinese<br />This is the best German Sybase system we could have built without T...
Validation: Adobe Polish<br />Training Data (sentences):<br />General	1.5M<br />Microsoft	1.7M<br />Adobe	129K<br />TAUS o...
Validation: Dell Japanese<br />Training data (sentences)<br /><ul><li>General 	4.3M
Microsoft 	3.2M
Upcoming SlideShare
Loading in …5
×

Improving the quality of a customized SMT system using shared training data

2,069 views
1,933 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,069
On SlideShare
0
From Embeds
0
Number of Embeds
241
Actions
Shares
0
Downloads
24
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • 2 things:Show that pooling works, especially for data owners with less bitext than MicrosoftCustomization gives the data a boost
  • Not a surprise, we strongly prefer the lambda target’s style and terminology in this case
  • Not a surprise, we strongly prefer the lambda target’s style and terminology in this case
  • Not a surprise, we strongly prefer the lambda target’s style and terminology in this case
  • Not a surprise, we strongly prefer the lambda target’s style and terminology in this case
  • Not a surprise, we strongly prefer the lambda target’s style and terminology in this case
  • System 1:- General test set BLEU = 0.1590- Tech* test set BLEU = 0.2890- TAUS Adobe test set BLEU = 0.1940 System 3b:- General test set BLEU = 0.1353- Tech* test set BLEU = 0.3388- TAUS Adobe test set BLEU = 0.3374 Training Data (lines/words):- Gendom: 1520177 lines / 22632777 enu words 19988095 plk words- Tech: 1786035 lines / 22717903 enu words 21205994 plk words- TAUS: 199210 lines / 2439361 enu words 2306301 plk words- TAUS Adobe: 129084 lines / 1664918 enu words 1512067 plk words
  • System 1:- General test set BLEU = 0.1799- Tech* test set BLEU = 0.3788- TAUS Dell test set BLEU = 0.2672 System 2a:- General test set BLEU = 0.1476- Tech* test set BLEU = 0.3087- TAUS Dell test set BLEU = 0.3949 System 2b:- General test set BLEU = 0.1728- Tech* test set BLEU = 0.4132- TAUS Dell test set BLEU = 0.3264 System 3a:- General test set BLEU = 0.1733- Tech* test set BLEU = 0.4230- TAUS Dell test set BLEU = 0.3989 System 3b:- General test set BLEU = 0.1485- Tech* test set BLEU = 0.3221- TAUS Dell test set BLEU = 0.4243 Training Data:- Gendom: 4348176 lines- Tech: 3299908 lines- TAUS: 1612637 lines- TAUS Dell: 172017 lines * = new set of Tech test sentences deduped against entire training set
  • Improving the quality of a customized SMT system using shared training data

    1. 1. Improving the quality of a customized SMT system using shared training data<br />Chris.Wendt@microsoft.com<br />Will.Lewis@microsoft.com<br />August 28, 2009<br />1<br />
    2. 2. Overview <br />Engine and Customization Basics<br />Experiment Objective<br />Experiment Setup<br />Experiment Results<br />Validation<br />Conclusions<br />2<br />
    3. 3. Microsoft’s Statistical MT Engine<br />3<br />Linguistically informed SMT<br />
    4. 4. Microsoft Translator Runtime<br />4<br />
    5. 5. Training<br />5<br />
    6. 6. Microsoft’s Statistical MT Engine<br />6<br />
    7. 7. Adding Domain Specificity <br />7<br />
    8. 8. Experiment Objective<br />Objective<br />Determine the effect of pooling parallel data among multiple data providers within a domain, measured by the translation quality of an SMT system trained with that data.<br />8<br />
    9. 9. Experiment Setup<br />Data pool: TAUS Data Association’s repository of parallel translation data.<br />Domain: computer-related technical documents.<br /> No difference is made between software, hardware, documentation and marketing material.<br />Criteria for test case selection:<br /><ul><li>More than 100,000 segments of parallel training data
    10. 10. Less than 2M segments of parallel training data (at that point it would be valid to train a System using only the provider’s own data)</li></ul>Chosen case: Sybase<br />Experiment Series: Observe BLEU scores using a reserved subset of the submitted data against systems trained with<br />1 General data, as used for www.microsofttranslator.com<br />2a Only Microsoft’s internal parallel data, from localization of its own products<br />2b Microsoft data + Sybase data<br />3a General + Microsoft + TAUS<br />3b General + Microsoft data + TAUS, with Sybase custom lambdas<br />Measure BLEU on 3 sets of test documents, with 1 reference, reserved from the submission, not used in training:<br /><ul><li>Sybase
    11. 11. Microsoft
    12. 12. General</li></ul>9<br />
    13. 13. System Details<br />10<br />
    14. 14. Training data composition<br />German<br />Chinese (Simplified)<br />Sybase does not have enough data to build a system exclusively with Sybase data<br />11<br />
    15. 15. Experiment Results, measured in BLEU<br />Chinese<br />German<br />12<br />
    16. 16. Experiment Results, measured in BLEU<br />Chinese<br />German<br />13<br />
    17. 17. Experiment Results, measured in BLEU<br />Chinese<br />German<br />14<br />More than 8 point gain compared to system built without the shared data<br />
    18. 18. Experiment Results, measured in BLEU<br />Chinese<br />Best results are achieved using the maximum available data within the domain, using custom lambda training<br />German<br />15<br />
    19. 19. Experiment Results, measured in BLEU<br />Chinese<br />Weight training (lambda training) without diversity in the training data has very little effect<br />German<br />16<br />The diversity aspect was somewhat a surprise for us. Microsoft’s large data pool by itself did not give Sybase the hoped-for boost.<br />
    20. 20. Experiment Results, measured in BLEU<br />Chinese<br />Lambda training with in-domain diversity has a significant positive effect for the lambda target, and a significant negative effect for everyone else<br />German<br />17<br />
    21. 21. Experiment Results, measured in BLEU<br />Chinese<br />A system can be customized with small amounts of target language material, as long as there is a diverse set of in-domain parallel data available<br />German<br />18<br />
    22. 22. Experiment Results, measured in BLEU<br />Chinese<br />Small data providers benefit more from sharing than large data providers, but all benefit<br />German<br />19<br />
    23. 23. Experiment Results, measured in BLEU<br />Chinese<br />This is the best German Sybase system we could have built without TAUS<br />German<br />20<br />
    24. 24. Validation: Adobe Polish<br />Training Data (sentences):<br />General 1.5M<br />Microsoft 1.7M<br />Adobe 129K<br />TAUS other 70K<br />21<br />Even for a language without a lot of training data we can see nice gains by pooling.<br />
    25. 25. Validation: Dell Japanese<br />Training data (sentences)<br /><ul><li>General 4.3M
    26. 26. Microsoft 3.2M
    27. 27. TAUS 1.4M
    28. 28. Dell 172K</li></ul>22<br />Confirms the Sybase results<br />
    29. 29. Example<br />SRC The Monitor collects metrics and performance data from the databases and MobiLink servers running on other computers, while a separate computer accesses the Monitor via a web browser. <br />1 Der Monitor sammelt Metriken und Leistungsdaten von Datenbanken und MobiLink-Servern, die auf anderen Computern ausführen, während auf ein separater Computer greift auf den Monitor über einen Web-Browser. <br />2a Der Monitor sammelt Metriken und Performance-Daten von der Datenbanken und MobiLink-Server auf anderen Computern ausgeführt werden, während ein separater Computer den Monitor über einen Webbrowser zugreift.<br />2b Der Monitor sammelt Metriken und Performance-Daten von der Datenbanken und MobiLink-Server auf anderen Computern ausgeführt werden, während ein separater Computer den Monitor über einen Webbrowser zugreift. <br />3a Der Monitor sammelt Metriken und Performance-Daten von der Datenbanken und MobiLink-Server auf anderen Computern ausgeführt werden, während ein separater Computer den Monitor über einen Webbrowser zugreift. <br />3b Der Monitor sammelt Kriterien und Performance-Daten aus der Datenbanken und MobiLink-Server auf anderen Computern ausgeführt werden, während ein separater Computer des Monitors über einen Webbrowser zugreift. <br />REF Der Monitor sammelt Kriterien und Performance-Daten aus den Datenbanken und MobiLink-Servern die auf anderen Computern ausgeführt werden, während ein separater Computer auf den Monitor über einen Webbrowser zugreift. <br />Google Der Monitor sammelt Metriken und Performance-Daten aus den Datenbanken und MobiLink-Server auf anderen Computern ausgeführt, während eine separate Computer auf dem Monitor über einen Web-Browser. <br />23<br />
    30. 30. Observations<br /><ul><li>Combining in-domain training data gives a significant boost to MT quality. In our experiment more than 8 BLEU points compared to the best System built without the shared data.
    31. 31. Weight training (Lambda training) without diversity in the training data has almost no effect
    32. 32. Lambda training with in-domain diversity has a significant positive effect for the lambda target, and a significant negative effect for everyone else
    33. 33. A system can be customized with small amounts of target language material, as long as there is a diverse set of in-domain parallel data available
    34. 34. Best results are achieved using the maximum available data within the domain, using custom lambda training
    35. 35. Small data providers benefit more from sharing than large data providers, but all benefit</li></ul>24<br />
    36. 36. Results<br /><ul><li>There is noticeable benefit in sharing parallel data among multiple data owners within the same domain, as is the intent of the TAUS Data Association.
    37. 37. An MT system trained with the combined data can deliver significantly improved translation quality, compared to a system trained only with the provider’s own data plus baseline training.
    38. 38. Customization via a separate target language model and lambda training works</li></ul>25<br />
    39. 39. References<br />Chris Quirk, Arul Menezes, and Colin Cherry, Dependency Treelet Translation: Syntactically Informed Phrasal SMT, in Proceedings of ACL, Association for Computational Linguistics, June 2005<br />Microsoft Translator: www.microsofttranslator.com<br />TAUS Data Association: www.tausdata.org<br />26<br />

    ×