The document discusses a presentation given at the TAUS User Conference 2010 by Chris Wendt of Microsoft Research on machine translation technologies. The presentation covered Microsoft's machine translation service, which translates 32 languages using statistical machine translation models trained on vast amounts of publicly and privately sourced parallel text data. It also discussed efforts to improve translation quality through continued expansion of training data and collaborative frameworks to incorporate human feedback.
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
TAUS USER CONFERENCE 2010, More data equals better machine translation – the Microsoft view
1. TAUS USER CONFERENCE 2010
LANGUAGE BUSINESS INNOVATION
4 – 6 OCTOBER / PORTLAND (OR), USA
TUESDAY 5 OCTOBER / 10.30
MORE DATA EQUALS BETTER MACHINE
TRANSLATION – THE MICROSOFT VIEW
Chris Wendt, Microsoft Research
7. Microsoft’s MT Service at a Glance
Translates 32 languages, any to any.
http://translator.bing.com (text and URLs)
Office: selection or whole document
via research pane, ribbon and mini translator
IE8 accelerator: popup, whole page
Messenger translation bot
Unique side-by-side web page viewer
In-place translation widget with collaborative translations
Bing “Translate this page”
Bing “Translate This” instant answer
Free API and Collaborative Translations Framework
8. Free Web API
SOAP http://api.microsofttranslator.com
AJAX
http (REST)
Very simple methods string input = "My input sentence.";
Detect() string output = s.Translate(_appId, input, "", "de");
Translate()
AddTranslation()
GetTranslations()
Advanced methods
Array functionality for the above
Text to Speech
Sentence Breaking
More language related methods in the works
9.
10. Statistical MT - The Simple View
User Input
Text, web pages, Chat etc
Government data
Microsoft manuals Collect and store
Dictionaries Train statistical Translation
parallel and target
Phrasebooks models Engine
Translation
language data
Publisher data Engine
Distributed Runtime
Cosmos Translation APIs and UX
Cluster HPC/MPI Cluster
Web data
Translated Output
11. Microsoft’s Statistical MT Engine
Languages with source
Syntactically informed SMT
parser: English, Spanish,
Japanese, French, German,
Italian
Source
language Syntactic tree based decoder
parser
Rule-based post
HTML handling
processing
Sentence breaking
Case restoration
Source
language Surface string based decoder
word breaker
Distance and Contextual Syntactic
Other source languages word-based translation reordering
reordering model model
Target Syntactic word
Models language insertion and
model deletion model
12.
13. Data Sources
Web data gathering
Web-scale algorithms to find parallel pages
Page and sentence alignment
Existing (mostly) parallel data
Microsoft manuals
Dictionaries, phrasebooks
Government Data
Data sharing associations
Linguistic Data Consortium, Taus Data Association, ELRA, …
Licensed data
Microsoft Press, …
Comparable (non-parallel) data
Wikipedia
News articles
15. Human Evaluations
Absolute
3 to 5 independent human evaluators are asked to rank
translation quality for 250 sentences on a scale of 1 to 4
Comparing to human translated sentence
No source language knowledge required
4 Ideal Grammatically correct, all information
included
3 Acceptable Not perfect, but definitely comprehensible,
and with accurate transfer of all important
information
2 Possibly Acceptable May be interpretable given context/time, some
information transferred accurately
1 Unacceptable Absolutely not comprehensible and/or little or
not information transferred accurately
Also: Relative evals, against a competitor, or a previous version of ourselves
17. Quality improvements in 2009
BLEU by Release (EX) BLEU by Release (XE)
ARA
BGR
CHS
CSY
DAN
DEU
ELL
ESN
FIN
FRA
HEB
ITA
JPN
KOR
NLD
PLK
Aug-08
Jul-08
Aug-09
Jun-08
Dec-08
Oct-08
Jan-09
Jun-09
Jul-09
Apr-08
Sep-08
Nov-08
Feb-09
Oct-09
Dec-09
May-08
Apr-09
Sep-09
Nov-09
Jan-10
Feb-10
Mar-09
May-09
PTB
RUS
SVE
THA
5.4 5.5 5.6 6.0 5.4 5.5 5.6 6.0
18. Experiment Results, measured in BLEU
Chinese
Test Set
System Size System Description General Microsoft Sybase
1 8.3M General domain 14.26 29.74 34.81
2a 2.6M Microsoft 12.32 34.65 29.95
2b 2.8M Microsoft with Sybase 12.16 34.66 30.24
3a 11.5M General and Microsoft and TAUS 15.38 35.80 44.49
3b 11.5M System 3a with Sybase lambda 12.57 29.51 47.16
German
Test Set
System Size System Description General Microsoft Sybase
1 4.4M General Domain 25.19 40.61 34.85
2a 7.6M Microsoft 21.95 52.39 41.55
2b 7.8M Microsoft with Sybase 22.83 52.07 42.07
3a 11.1M General and Microsoft and TAUS 23.86 52.72 48.83
3b 11.1M System 3a with Sybase lambda 19.44 37.27 50.85
18
19.
20. HAT: A Paradigm Shift
Computer Aided Translation
is becoming
Human Aided Translation
Machine Translation is
Good enough to get the meaning across
Not good enough to fully substitute human translation
Merge MT with Human Translation using massive
amounts of parallel data, and the community of
humans
22. Example: CSS Knowledge Base – Czech
Data from Kai Gehrlach, Martine Smets, and Chris Moore
23. Search Engine Optimization
Machine Translating the Czech Knowledge Base
October 2009 January 2010
2.5% of content of the English 2.5% of content of the English
KB is human translated to KB is human translated to
Czech, ranked by page view. Czech, ranked by page view.
The top 2.5% cover an estimated The top 2.5% cover an estimated
50% of the page views. 50% of the page views.
The remaining content is The remaining content is
untranslated. machine translated, starting
December 5 and completed over
the next 10 days.
24. Referrals from the Czech Republic
140,000
Referrals to the CSS KB
site from the top 2 120,000
search engines in the
Czech Republic 100,000
(google.cz and
seznam.cz
80,000
to the Czech KB (blue)
cs
to the KB in other 60,000 All other languages
languages (green)
40,000
20,000
0
Oct FY10 Nov FY10 Dec FY10 Jan FY10
25. Resolution Rate Across Languages
Arabic
Chinese (People's Republic of China)
Chinese (Taiwan)
Czech
French
German
Italian
Resolution rate HT
Japanese
Resolution rate MT
Korean
Portuguese
Portuguese (Brazil)
Russian
Spanish
Turkish
0% 10% 20% 30% 40% 50% 60% 70%
Source: Martine Smets,
Microsoft Customer
Support
26.
27. Adding Domain Specificity
Syntactic tree based decoder
Domain Custom Model
Generic
Other Models Language
Target
Model
Contextual language
translation model
model
Models
This model includes The target language models
parallel data for the have an effect only if there is
Weight distribution
domain as well as my matching data in the translation
determined by Λ Training
company model
27
28. Microsoft Translator Runtime
Determines the best
(85) alternative (14)
Returns result to
Distributer Model
Leaf Leaf Leaf Leaf
Server
(4)
Model
Distributor Leaf
Reassembles result Leaf Leaf Leaf
Server
chunk
Finds an engine to
translate the sentence Model
Load Balancer
Distributor Leaf Leaf Leaf
Consults models
Leaf
Server
Breaks chunk into
Gets a chunk to Model
Distributor sentences
Leaf Leaf Leaf Leaf
translate Server
Model
Distributor Leaf Leaf Leaf Leaf
Server
Model
Leaf Leaf Leaf Leaf
Server
29. Training
400-CPU CCS/HPC cluster
Parallel Source language
Data parsing
Discrim. Train Model
model weights weights
Treelet +
Source/Target
Word alignment Syntactic structure
word breaking
extraction
Target
language
monolingual
data
Language Surface
Phrase table Treelet table Syntactic models
model reordering
extraction extraction training
training training
Case Target Distance and Contextual Syntactic Syntactic word
restoration language word-based translation reordering insertion and
model Target
model reordering models model deletion model
language
Target
model
language
model
29
30. References
Chris Quirk, Arul Menezes, and Colin Cherry, Dependency Treelet Translation:
Syntactically Informed Phrasal SMT, in Proceedings of ACL, Association for
Computational Linguistics, June 2005
Microsoft Translator: www.microsofttranslator.com
TAUS Data Association: www.tausdata.org
30