SlideShare a Scribd company logo
1 of 31
TAUS USER CONFERENCE 2010
LANGUAGE BUSINESS INNOVATION
4 – 6 OCTOBER / PORTLAND (OR), USA




TUESDAY 5 OCTOBER / 10.30

MORE DATA EQUALS BETTER MACHINE
TRANSLATION – THE MICROSOFT VIEW
Chris Wendt, Microsoft Research
Agenda

 Microsoft Translator At A Glance
 Some basic technical info
 Drivers of coverage and quality
Internet Explorer
Office 2010
Windows Live Messenger
Bing Instant Answer and Web Page Viewer
Microsoft’s MT Service at a Glance
 Translates 32 languages, any to any.
 http://translator.bing.com (text and URLs)
 Office: selection or whole document
    via research pane, ribbon and mini translator
 IE8 accelerator: popup, whole page
 Messenger translation bot
 Unique side-by-side web page viewer
 In-place translation widget with collaborative translations
 Bing “Translate this page”
 Bing “Translate This” instant answer
 Free API and Collaborative Translations Framework
Free Web API
 SOAP                           http://api.microsofttranslator.com
 AJAX
 http (REST)

 Very simple methods            string input = "My input sentence.";
    Detect()                    string output = s.Translate(_appId, input, "", "de");
    Translate()
    AddTranslation()
    GetTranslations()

 Advanced methods
    Array functionality for the above
    Text to Speech
    Sentence Breaking

 More language related methods in the works
Statistical MT - The Simple View


                                                                      User Input
                                                              Text, web pages, Chat etc




Government data
Microsoft manuals   Collect and store
   Dictionaries                           Train statistical          Translation
                    parallel and target
  Phrasebooks                                 models                   Engine
                                                                     Translation
                     language data
 Publisher data                                                        Engine

                                                                 Distributed Runtime


                        Cosmos                                Translation APIs and UX
                         Cluster          HPC/MPI Cluster
                        Web data

                                                                 Translated Output
Microsoft’s Statistical MT Engine

Languages with source
                                                                   Syntactically informed SMT
parser: English, Spanish,
Japanese, French, German,
Italian


                              Source
                             language                       Syntactic tree based decoder
                               parser
                                                                                                        Rule-based post
    HTML handling
                                                                                                          processing
   Sentence breaking
                                                                                                        Case restoration
                              Source
                             language        Surface string based decoder
                            word breaker



                                           Distance and          Contextual                 Syntactic
Other source languages                     word-based            translation               reordering
                                            reordering              model                    model


                                                                    Target             Syntactic word
                                             Models               language              insertion and
                                                                    model              deletion model
Data Sources
 Web data gathering
    Web-scale algorithms to find parallel pages
    Page and sentence alignment
 Existing (mostly) parallel data
    Microsoft manuals
    Dictionaries, phrasebooks
    Government Data
    Data sharing associations
      Linguistic Data Consortium, Taus Data Association, ELRA, …

    Licensed data
      Microsoft Press, …

 Comparable (non-parallel) data
    Wikipedia
    News articles
Apr-08
May-08
Jun-08
 Jul-08
Aug-08
Sep-08
Oct-08
Nov-08
Dec-08
Jan-09
Feb-09
          Parallel Sentences




Mar-09
Apr-09
May-09
Jun-09
 Jul-09
Aug-09
Sep-09
Oct-09
Nov-09
Dec-09
 Jan-10
Feb-10
Human Evaluations
 Absolute
 3 to 5 independent human evaluators are asked to rank
  translation quality for 250 sentences on a scale of 1 to 4
   Comparing to human translated sentence
   No source language knowledge required

   4 Ideal               Grammatically correct, all information
                         included
   3 Acceptable          Not perfect, but definitely comprehensible,
                         and with accurate transfer of all important
                         information
   2 Possibly Acceptable May be interpretable given context/time, some
                         information transferred accurately
   1 Unacceptable        Absolutely not comprehensible and/or little or
                         not information transferred accurately

 Also: Relative evals, against a competitor, or a previous version of ourselves
Human Evaluation Scores
     ARA-ENU   2.81   ENU-ARA   2.13
     BGR-ENU   2.70   ENU-BGR   2.17
     CHS-ENU   2.54   ENU-CHS   2.38
     CSY-ENU   2.27   ENU-CSY   2.01
     DAN-ENU   2.84   ENU-DAN   2.58
     DEU-ENU   3.17   ENU-DEU   2.63
     ELL-ENU   2.65   ENU-ELL   2.12
     ESN-ENU   2.80   ENU-ESN   2.69
     FIN-ENU   2.46   ENU-FIN   2.26
     FRA-ENU   2.67   ENU-FRA   2.44
     HEB-ENU   2.53   ENU-HEB   2.37
     ITA-ENU   2.83   ENU-ITA   2.35
     JPN-ENU   2.53   ENU-JPN   2.52
     KOR-ENU   2.35   ENU-KOR   2.59
     NLD-ENU   2.55   ENU-NLD   2.47
     PLK-ENU   2.67   ENU-PLK   1.66
     PTB-ENU   2.79   ENU-PTB   2.51
     RUS-ENU   2.72   ENU-RUS   2.20
     SVE-ENU   2.83   ENU-SVE   2.40
     THA-ENU   2.18   ENU-THA   2.24
     TRK-ENU   2.07   ENU-TRK   2.18
Quality improvements in 2009
                    BLEU by Release (EX)                                                                                                 BLEU by Release (XE)

                                                                                                                                                                                                                    ARA
                                                                                                                                                                                                                    BGR
                                                                                                                                                                                                                    CHS
                                                                                                                                                                                                                    CSY
                                                                                                                                                                                                                    DAN
                                                                                                                                                                                                                    DEU
                                                                                                                                                                                                                    ELL
                                                                                                                                                                                                                    ESN
                                                                                                                                                                                                                    FIN
                                                                                                                                                                                                                    FRA
                                                                                                                                                                                                                    HEB
                                                                                                                                                                                                                    ITA
                                                                                                                                                                                                                    JPN
                                                                                                                                                                                                                    KOR
                                                                                                                                                                                                                    NLD
                                                                                                                                                                                                                    PLK
                                    Aug-08
                           Jul-08




                                                                                                                                                    Aug-09
                  Jun-08




                                                                         Dec-08
                                                      Oct-08




                                                                                  Jan-09




                                                                                                                                Jun-09
                                                                                                                                         Jul-09
Apr-08




                                             Sep-08


                                                                Nov-08




                                                                                            Feb-09




                                                                                                                                                                      Oct-09


                                                                                                                                                                                         Dec-09
         May-08




                                                                                                              Apr-09




                                                                                                                                                             Sep-09


                                                                                                                                                                                Nov-09


                                                                                                                                                                                                  Jan-10
                                                                                                                                                                                                           Feb-10
                                                                                                     Mar-09


                                                                                                                       May-09


                                                                                                                                                                                                                    PTB
                                                                                                                                                                                                                    RUS
                                                                                                                                                                                                                    SVE
                                                                                                                                                                                                                    THA

         5.4                        5.5                        5.6                         6.0                     5.4                        5.5                         5.6                        6.0
Experiment Results, measured in BLEU
Chinese
                                                          Test Set
System Size    System Description               General   Microsoft   Sybase
1      8.3M    General domain                     14.26       29.74    34.81
2a     2.6M    Microsoft                          12.32       34.65    29.95
2b     2.8M    Microsoft with Sybase              12.16       34.66    30.24
3a     11.5M   General and Microsoft and TAUS     15.38       35.80    44.49
3b     11.5M   System 3a with Sybase lambda       12.57       29.51    47.16
German
                                                          Test Set
System Size    System Description               General   Microsoft   Sybase
1      4.4M    General Domain                     25.19       40.61    34.85
2a     7.6M    Microsoft                          21.95       52.39    41.55
2b     7.8M    Microsoft with Sybase              22.83       52.07    42.07
3a     11.1M   General and Microsoft and TAUS     23.86       52.72    48.83
3b     11.1M   System 3a with Sybase lambda       19.44       37.27    50.85

                                                                           18
HAT: A Paradigm Shift

              Computer Aided Translation
                          is becoming

               Human Aided Translation

 Machine Translation is
    Good enough to get the meaning across
    Not good enough to fully substitute human translation


   Merge MT with Human Translation using massive
   amounts of parallel data, and the community of
   humans
Collaborative Translation Framework
         Community enhanced MT




                           Machine Translation
                                Models
                                                 Training
                               Worldwide
                           Translation Memory
Example: CSS Knowledge Base – Czech

Data from Kai Gehrlach, Martine Smets, and Chris Moore
Search Engine Optimization
Machine Translating the Czech Knowledge Base

October 2009                        January 2010
 2.5% of content of the English     2.5% of content of the English
  KB is human translated to           KB is human translated to
  Czech, ranked by page view.         Czech, ranked by page view.
 The top 2.5% cover an estimated    The top 2.5% cover an estimated
  50% of the page views.              50% of the page views.
 The remaining content is           The remaining content is
  untranslated.                       machine translated, starting
                                      December 5 and completed over
                                      the next 10 days.
Referrals from the Czech Republic
                             140,000
  Referrals to the CSS KB
   site from the top 2       120,000
   search engines in the
   Czech Republic            100,000
   (google.cz and
   seznam.cz
                             80,000
  to the Czech KB (blue)
                                                                                   cs
  to the KB in other        60,000                                                All other languages
   languages (green)
                             40,000



                             20,000



                                  0
                                       Oct FY10   Nov FY10   Dec FY10   Jan FY10
Resolution Rate Across Languages
                             Arabic

Chinese (People's Republic of China)

                  Chinese (Taiwan)

                              Czech

                             French

                            German

                             Italian
                                                                                        Resolution rate HT
                           Japanese
                                                                                        Resolution rate MT
                             Korean

                         Portuguese

                 Portuguese (Brazil)

                            Russian

                            Spanish

                            Turkish

                                       0%   10%   20%   30%   40%   50%   60%    70%


                                                                            Source: Martine Smets,
                                                                            Microsoft Customer
                                                                            Support
Adding Domain Specificity

                                                  Syntactic tree based decoder




                                                           Domain                Custom Model
                                        Generic
   Other Models                                           Language
                                         Target
                                                            Model
                         Contextual    language
                         translation     model
                            model


                                                                                       Models


     This model includes                                                               The target language models
     parallel data for the                                                             have an effect only if there is
                                                         Weight distribution
     domain as well as my                                                              matching data in the translation
                                                      determined by Λ Training
     company                                                                           model




                                                                                                                    27
Microsoft Translator Runtime
                                                                                  Determines the best
                                                                        (85)          alternative       (14)
                                                Returns result to
                                                  Distributer                                           Model
                                                 Leaf            Leaf          Leaf         Leaf
                                                                                                        Server
                            (4)
                                                                                                        Model
                          Distributor           Leaf
                                 Reassembles result             Leaf           Leaf         Leaf
                                                                                                        Server
                                       chunk
                                                               Finds an engine to
                                                             translate the sentence                     Model
 Load Balancer




                          Distributor            Leaf           Leaf            Leaf
                                                                                  Consults models
                                                                                             Leaf
                                                                                                        Server

                                        Breaks chunk into
                 Gets a chunk to                                                                        Model
                          Distributor       sentences
                                                  Leaf          Leaf           Leaf         Leaf
                    translate                                                                           Server


                                                                                                        Model
                          Distributor            Leaf           Leaf           Leaf         Leaf
                                                                                                        Server


                                                                                                        Model
                                                 Leaf           Leaf           Leaf         Leaf
                                                                                                        Server
Training
                                                                          400-CPU CCS/HPC cluster
  Parallel        Source language
   Data               parsing


                                                                                              Discrim. Train         Model
                                                                                              model weights          weights
                                                                       Treelet +
                   Source/Target
                                         Word alignment            Syntactic structure
                   word breaking
                                                                       extraction

   Target
 language
monolingual
    data
                  Language           Surface
                                                    Phrase table           Treelet table      Syntactic models
                    model           reordering
                                                     extraction             extraction            training
                   training           training



       Case         Target           Distance and           Contextual                 Syntactic        Syntactic word
    restoration   language           word-based             translation               reordering         insertion and
      model            Target
                    model             reordering              models                    model           deletion model
                     language
                           Target
                       model
                         language
                           model




                                                                                                                           29
References
 Chris Quirk, Arul Menezes, and Colin Cherry, Dependency Treelet Translation:
  Syntactically Informed Phrasal SMT, in Proceedings of ACL, Association for
  Computational Linguistics, June 2005
 Microsoft Translator: www.microsofttranslator.com
 TAUS Data Association: www.tausdata.org




                                                                                 30
TAUS USER CONFERENCE 2010, More data equals better machine translation – the Microsoft view

More Related Content

Viewers also liked

Terminology Life Cycle Management Increasing Company-Wide Terminology Collabo...
Terminology Life Cycle Management Increasing Company-Wide Terminology Collabo...Terminology Life Cycle Management Increasing Company-Wide Terminology Collabo...
Terminology Life Cycle Management Increasing Company-Wide Terminology Collabo...TAUS - The Language Data Network
 
Common industry API for translation services presented by TAUS at FEISGILTT
Common industry API for translation services presented by TAUS at FEISGILTTCommon industry API for translation services presented by TAUS at FEISGILTT
Common industry API for translation services presented by TAUS at FEISGILTTTAUS - The Language Data Network
 
Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
 Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann... Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...TAUS - The Language Data Network
 
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)TAUS - The Language Data Network
 
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...TAUS - The Language Data Network
 
A translation memory P2P trading platform - to make global translation memory...
A translation memory P2P trading platform - to make global translation memory...A translation memory P2P trading platform - to make global translation memory...
A translation memory P2P trading platform - to make global translation memory...TAUS - The Language Data Network
 
The Theory and Practice of Computer Aided Translation Training System, Liu Q...
 The Theory and Practice of Computer Aided Translation Training System, Liu Q... The Theory and Practice of Computer Aided Translation Training System, Liu Q...
The Theory and Practice of Computer Aided Translation Training System, Liu Q...TAUS - The Language Data Network
 
TAUS MT SHOWCASE, The WeMT Program, Olga Beregovaya, Welocalize, 10 October 2...
TAUS MT SHOWCASE, The WeMT Program, Olga Beregovaya, Welocalize, 10 October 2...TAUS MT SHOWCASE, The WeMT Program, Olga Beregovaya, Welocalize, 10 October 2...
TAUS MT SHOWCASE, The WeMT Program, Olga Beregovaya, Welocalize, 10 October 2...TAUS - The Language Data Network
 
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...TAUS - The Language Data Network
 
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...TAUS - The Language Data Network
 

Viewers also liked (19)

Quality Management in Localization Certification
Quality Management in Localization CertificationQuality Management in Localization Certification
Quality Management in Localization Certification
 
Terminology Life Cycle Management Increasing Company-Wide Terminology Collabo...
Terminology Life Cycle Management Increasing Company-Wide Terminology Collabo...Terminology Life Cycle Management Increasing Company-Wide Terminology Collabo...
Terminology Life Cycle Management Increasing Company-Wide Terminology Collabo...
 
Common industry API for translation services presented by TAUS at FEISGILTT
Common industry API for translation services presented by TAUS at FEISGILTTCommon industry API for translation services presented by TAUS at FEISGILTT
Common industry API for translation services presented by TAUS at FEISGILTT
 
Terminology in the cloud with memoQ and TaaS, CHAT2013
Terminology in the cloud with memoQ and TaaS, CHAT2013Terminology in the cloud with memoQ and TaaS, CHAT2013
Terminology in the cloud with memoQ and TaaS, CHAT2013
 
TAUS Best Practices Error Typology Guidelines
TAUS Best Practices Error Typology GuidelinesTAUS Best Practices Error Typology Guidelines
TAUS Best Practices Error Typology Guidelines
 
Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
 Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann... Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
 
TAUS Best Practices Adequacy/Fluency Guidelines
TAUS Best Practices Adequacy/Fluency GuidelinesTAUS Best Practices Adequacy/Fluency Guidelines
TAUS Best Practices Adequacy/Fluency Guidelines
 
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
 
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
 
A translation memory P2P trading platform - to make global translation memory...
A translation memory P2P trading platform - to make global translation memory...A translation memory P2P trading platform - to make global translation memory...
A translation memory P2P trading platform - to make global translation memory...
 
How we train post-editors - Yongpeng Wei (Lingosail)
How we train post-editors - Yongpeng Wei (Lingosail)How we train post-editors - Yongpeng Wei (Lingosail)
How we train post-editors - Yongpeng Wei (Lingosail)
 
Farmer Lv (TrueTran)
Farmer Lv (TrueTran)Farmer Lv (TrueTran)
Farmer Lv (TrueTran)
 
SDL Trados Studio 2017, Jocelyn He (SDL)
SDL Trados Studio 2017, Jocelyn He (SDL)SDL Trados Studio 2017, Jocelyn He (SDL)
SDL Trados Studio 2017, Jocelyn He (SDL)
 
The Theory and Practice of Computer Aided Translation Training System, Liu Q...
 The Theory and Practice of Computer Aided Translation Training System, Liu Q... The Theory and Practice of Computer Aided Translation Training System, Liu Q...
The Theory and Practice of Computer Aided Translation Training System, Liu Q...
 
TAUS MT SHOWCASE, The WeMT Program, Olga Beregovaya, Welocalize, 10 October 2...
TAUS MT SHOWCASE, The WeMT Program, Olga Beregovaya, Welocalize, 10 October 2...TAUS MT SHOWCASE, The WeMT Program, Olga Beregovaya, Welocalize, 10 October 2...
TAUS MT SHOWCASE, The WeMT Program, Olga Beregovaya, Welocalize, 10 October 2...
 
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
 
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
 
TAUS MT Post-Editing Guidelines
TAUS MT Post-Editing GuidelinesTAUS MT Post-Editing Guidelines
TAUS MT Post-Editing Guidelines
 
Translation Technology Showcase in Shenzhen
Translation Technology Showcase in ShenzhenTranslation Technology Showcase in Shenzhen
Translation Technology Showcase in Shenzhen
 

Similar to TAUS USER CONFERENCE 2010, More data equals better machine translation – the Microsoft view

I F T S – S Q L 2008 F T S Engine
I F T S –  S Q L 2008  F T S  EngineI F T S –  S Q L 2008  F T S  Engine
I F T S – S Q L 2008 F T S Enginesqlserver.co.il
 
TAUS USER CONFERENCE 2010, What’s on the horizon? The research agenda
TAUS USER CONFERENCE 2010, What’s on the horizon? The research agendaTAUS USER CONFERENCE 2010, What’s on the horizon? The research agenda
TAUS USER CONFERENCE 2010, What’s on the horizon? The research agendaTAUS - The Language Data Network
 
Maxim Zaks: Deep dive into data serialisation
Maxim Zaks: Deep dive into data serialisationMaxim Zaks: Deep dive into data serialisation
Maxim Zaks: Deep dive into data serialisationmdevtalk
 
Building and Implementing MT systems @ eBay – TAUS Global Content Summit 2019
Building and Implementing MT systems @ eBay – TAUS Global Content Summit 2019Building and Implementing MT systems @ eBay – TAUS Global Content Summit 2019
Building and Implementing MT systems @ eBay – TAUS Global Content Summit 2019Jose Luis Bonilla Sánchez
 
Deciphering voice of customer through speech analytics
Deciphering voice of customer through speech analyticsDeciphering voice of customer through speech analytics
Deciphering voice of customer through speech analyticsR Systems International
 
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a CorpusSvetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a CorpusSvetlin Nakov
 
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGES
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESA NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGES
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
 
Domain-Specific Software Engineering
Domain-Specific Software EngineeringDomain-Specific Software Engineering
Domain-Specific Software Engineeringelliando dias
 
Effect of morphological segmentation & de-segmentation on machine translation...
Effect of morphological segmentation & de-segmentation on machine translation...Effect of morphological segmentation & de-segmentation on machine translation...
Effect of morphological segmentation & de-segmentation on machine translation...Sunayana Gawde
 
Corpus annotation for corpus linguistics (nov2009)
Corpus annotation for corpus linguistics (nov2009)Corpus annotation for corpus linguistics (nov2009)
Corpus annotation for corpus linguistics (nov2009)Jorge Baptista
 
Deep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDBDeep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDBArangoDB Database
 
Language Server Protocol - Why the Hype?
Language Server Protocol - Why the Hype?Language Server Protocol - Why the Hype?
Language Server Protocol - Why the Hype?mikaelbarbero
 
JPM1415 Scene Text Recognition in Mobile Applications by Character Descripto...
JPM1415  Scene Text Recognition in Mobile Applications by Character Descripto...JPM1415  Scene Text Recognition in Mobile Applications by Character Descripto...
JPM1415 Scene Text Recognition in Mobile Applications by Character Descripto...chennaijp
 

Similar to TAUS USER CONFERENCE 2010, More data equals better machine translation – the Microsoft view (20)

Speech processing
Speech processingSpeech processing
Speech processing
 
I F T S – S Q L 2008 F T S Engine
I F T S –  S Q L 2008  F T S  EngineI F T S –  S Q L 2008  F T S  Engine
I F T S – S Q L 2008 F T S Engine
 
TAUS USER CONFERENCE 2010, What’s on the horizon? The research agenda
TAUS USER CONFERENCE 2010, What’s on the horizon? The research agendaTAUS USER CONFERENCE 2010, What’s on the horizon? The research agenda
TAUS USER CONFERENCE 2010, What’s on the horizon? The research agenda
 
Maxim Zaks: Deep dive into data serialisation
Maxim Zaks: Deep dive into data serialisationMaxim Zaks: Deep dive into data serialisation
Maxim Zaks: Deep dive into data serialisation
 
Building and Implementing MT systems @ eBay – TAUS Global Content Summit 2019
Building and Implementing MT systems @ eBay – TAUS Global Content Summit 2019Building and Implementing MT systems @ eBay – TAUS Global Content Summit 2019
Building and Implementing MT systems @ eBay – TAUS Global Content Summit 2019
 
Deciphering voice of customer through speech analytics
Deciphering voice of customer through speech analyticsDeciphering voice of customer through speech analytics
Deciphering voice of customer through speech analytics
 
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a CorpusSvetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
 
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGES
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESA NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGES
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGES
 
Domain-Specific Software Engineering
Domain-Specific Software EngineeringDomain-Specific Software Engineering
Domain-Specific Software Engineering
 
Effect of morphological segmentation & de-segmentation on machine translation...
Effect of morphological segmentation & de-segmentation on machine translation...Effect of morphological segmentation & de-segmentation on machine translation...
Effect of morphological segmentation & de-segmentation on machine translation...
 
Corpus annotation for corpus linguistics (nov2009)
Corpus annotation for corpus linguistics (nov2009)Corpus annotation for corpus linguistics (nov2009)
Corpus annotation for corpus linguistics (nov2009)
 
Arabic MT Project
Arabic MT ProjectArabic MT Project
Arabic MT Project
 
Deep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDBDeep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDB
 
Language Server Protocol - Why the Hype?
Language Server Protocol - Why the Hype?Language Server Protocol - Why the Hype?
Language Server Protocol - Why the Hype?
 
3.2
3.23.2
3.2
 
Introduction to Compilers | Phases & Structure
Introduction to Compilers | Phases & StructureIntroduction to Compilers | Phases & Structure
Introduction to Compilers | Phases & Structure
 
Text summarization
Text summarizationText summarization
Text summarization
 
JPM1415 Scene Text Recognition in Mobile Applications by Character Descripto...
JPM1415  Scene Text Recognition in Mobile Applications by Character Descripto...JPM1415  Scene Text Recognition in Mobile Applications by Character Descripto...
JPM1415 Scene Text Recognition in Mobile Applications by Character Descripto...
 
Build your own ASR engine
Build your own ASR engineBuild your own ASR engine
Build your own ASR engine
 
Moses
MosesMoses
Moses
 

More from TAUS - The Language Data Network

TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...TAUS - The Language Data Network
 
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...TAUS - The Language Data Network
 
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...TAUS - The Language Data Network
 
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...TAUS - The Language Data Network
 
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...TAUS - The Language Data Network
 
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)TAUS - The Language Data Network
 
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...TAUS - The Language Data Network
 
Traditional Models of Translation Outsourcing Seem Well-Established and Sound...
Traditional Models of Translation Outsourcing Seem Well-Established and Sound...Traditional Models of Translation Outsourcing Seem Well-Established and Sound...
Traditional Models of Translation Outsourcing Seem Well-Established and Sound...TAUS - The Language Data Network
 
Smarter, Faster, Better: The secrets of productive Machine Translation, Tony ...
Smarter, Faster, Better: The secrets of productive Machine Translation, Tony ...Smarter, Faster, Better: The secrets of productive Machine Translation, Tony ...
Smarter, Faster, Better: The secrets of productive Machine Translation, Tony ...TAUS - The Language Data Network
 
Driving Innovation through Standardized APIs, By Klaus Fleischmann, Kaleidoscope
Driving Innovation through Standardized APIs, By Klaus Fleischmann, KaleidoscopeDriving Innovation through Standardized APIs, By Klaus Fleischmann, Kaleidoscope
Driving Innovation through Standardized APIs, By Klaus Fleischmann, KaleidoscopeTAUS - The Language Data Network
 
Lights Out, Translation is Datafied, by Jaap van der Meer (TAUS)
Lights Out, Translation is Datafied, by Jaap van der Meer (TAUS)Lights Out, Translation is Datafied, by Jaap van der Meer (TAUS)
Lights Out, Translation is Datafied, by Jaap van der Meer (TAUS)TAUS - The Language Data Network
 
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)TAUS - The Language Data Network
 
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (2)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (2)Topic 4: The Magician's Hat: Turning Data into Business Intelligence (2)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (2)TAUS - The Language Data Network
 
Topic 4: The Magician's Hat: Turning Data into Business Intelligence
 Topic 4: The Magician's Hat: Turning Data into Business Intelligence Topic 4: The Magician's Hat: Turning Data into Business Intelligence
Topic 4: The Magician's Hat: Turning Data into Business IntelligenceTAUS - The Language Data Network
 

More from TAUS - The Language Data Network (16)

TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
 
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
 
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
 
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
 
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
 
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
 
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
 
Traditional Models of Translation Outsourcing Seem Well-Established and Sound...
Traditional Models of Translation Outsourcing Seem Well-Established and Sound...Traditional Models of Translation Outsourcing Seem Well-Established and Sound...
Traditional Models of Translation Outsourcing Seem Well-Established and Sound...
 
Smarter, Faster, Better: The secrets of productive Machine Translation, Tony ...
Smarter, Faster, Better: The secrets of productive Machine Translation, Tony ...Smarter, Faster, Better: The secrets of productive Machine Translation, Tony ...
Smarter, Faster, Better: The secrets of productive Machine Translation, Tony ...
 
QE Made Easy by Attila Görög (TAUS)
QE Made Easy by Attila Görög (TAUS)QE Made Easy by Attila Görög (TAUS)
QE Made Easy by Attila Görög (TAUS)
 
Driving Innovation through Standardized APIs, By Klaus Fleischmann, Kaleidoscope
Driving Innovation through Standardized APIs, By Klaus Fleischmann, KaleidoscopeDriving Innovation through Standardized APIs, By Klaus Fleischmann, Kaleidoscope
Driving Innovation through Standardized APIs, By Klaus Fleischmann, Kaleidoscope
 
Lights Out, Translation is Datafied, by Jaap van der Meer (TAUS)
Lights Out, Translation is Datafied, by Jaap van der Meer (TAUS)Lights Out, Translation is Datafied, by Jaap van der Meer (TAUS)
Lights Out, Translation is Datafied, by Jaap van der Meer (TAUS)
 
Topic 5: DQF Integrations and Use Cases
Topic 5: DQF Integrations and Use CasesTopic 5: DQF Integrations and Use Cases
Topic 5: DQF Integrations and Use Cases
 
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
 
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (2)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (2)Topic 4: The Magician's Hat: Turning Data into Business Intelligence (2)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (2)
 
Topic 4: The Magician's Hat: Turning Data into Business Intelligence
 Topic 4: The Magician's Hat: Turning Data into Business Intelligence Topic 4: The Magician's Hat: Turning Data into Business Intelligence
Topic 4: The Magician's Hat: Turning Data into Business Intelligence
 

Recently uploaded

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 

Recently uploaded (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 

TAUS USER CONFERENCE 2010, More data equals better machine translation – the Microsoft view

  • 1. TAUS USER CONFERENCE 2010 LANGUAGE BUSINESS INNOVATION 4 – 6 OCTOBER / PORTLAND (OR), USA TUESDAY 5 OCTOBER / 10.30 MORE DATA EQUALS BETTER MACHINE TRANSLATION – THE MICROSOFT VIEW Chris Wendt, Microsoft Research
  • 2. Agenda  Microsoft Translator At A Glance  Some basic technical info  Drivers of coverage and quality
  • 6. Bing Instant Answer and Web Page Viewer
  • 7. Microsoft’s MT Service at a Glance  Translates 32 languages, any to any.  http://translator.bing.com (text and URLs)  Office: selection or whole document  via research pane, ribbon and mini translator  IE8 accelerator: popup, whole page  Messenger translation bot  Unique side-by-side web page viewer  In-place translation widget with collaborative translations  Bing “Translate this page”  Bing “Translate This” instant answer  Free API and Collaborative Translations Framework
  • 8. Free Web API  SOAP http://api.microsofttranslator.com  AJAX  http (REST)  Very simple methods string input = "My input sentence.";  Detect() string output = s.Translate(_appId, input, "", "de");  Translate()  AddTranslation()  GetTranslations()  Advanced methods  Array functionality for the above  Text to Speech  Sentence Breaking  More language related methods in the works
  • 9.
  • 10. Statistical MT - The Simple View User Input Text, web pages, Chat etc Government data Microsoft manuals Collect and store Dictionaries Train statistical Translation parallel and target Phrasebooks models Engine Translation language data Publisher data Engine Distributed Runtime Cosmos Translation APIs and UX Cluster HPC/MPI Cluster Web data Translated Output
  • 11. Microsoft’s Statistical MT Engine Languages with source Syntactically informed SMT parser: English, Spanish, Japanese, French, German, Italian Source language Syntactic tree based decoder parser Rule-based post HTML handling processing Sentence breaking Case restoration Source language Surface string based decoder word breaker Distance and Contextual Syntactic Other source languages word-based translation reordering reordering model model Target Syntactic word Models language insertion and model deletion model
  • 12.
  • 13. Data Sources  Web data gathering  Web-scale algorithms to find parallel pages  Page and sentence alignment  Existing (mostly) parallel data  Microsoft manuals  Dictionaries, phrasebooks  Government Data  Data sharing associations  Linguistic Data Consortium, Taus Data Association, ELRA, …  Licensed data  Microsoft Press, …  Comparable (non-parallel) data  Wikipedia  News articles
  • 14. Apr-08 May-08 Jun-08 Jul-08 Aug-08 Sep-08 Oct-08 Nov-08 Dec-08 Jan-09 Feb-09 Parallel Sentences Mar-09 Apr-09 May-09 Jun-09 Jul-09 Aug-09 Sep-09 Oct-09 Nov-09 Dec-09 Jan-10 Feb-10
  • 15. Human Evaluations  Absolute  3 to 5 independent human evaluators are asked to rank translation quality for 250 sentences on a scale of 1 to 4  Comparing to human translated sentence  No source language knowledge required 4 Ideal Grammatically correct, all information included 3 Acceptable Not perfect, but definitely comprehensible, and with accurate transfer of all important information 2 Possibly Acceptable May be interpretable given context/time, some information transferred accurately 1 Unacceptable Absolutely not comprehensible and/or little or not information transferred accurately Also: Relative evals, against a competitor, or a previous version of ourselves
  • 16. Human Evaluation Scores ARA-ENU 2.81 ENU-ARA 2.13 BGR-ENU 2.70 ENU-BGR 2.17 CHS-ENU 2.54 ENU-CHS 2.38 CSY-ENU 2.27 ENU-CSY 2.01 DAN-ENU 2.84 ENU-DAN 2.58 DEU-ENU 3.17 ENU-DEU 2.63 ELL-ENU 2.65 ENU-ELL 2.12 ESN-ENU 2.80 ENU-ESN 2.69 FIN-ENU 2.46 ENU-FIN 2.26 FRA-ENU 2.67 ENU-FRA 2.44 HEB-ENU 2.53 ENU-HEB 2.37 ITA-ENU 2.83 ENU-ITA 2.35 JPN-ENU 2.53 ENU-JPN 2.52 KOR-ENU 2.35 ENU-KOR 2.59 NLD-ENU 2.55 ENU-NLD 2.47 PLK-ENU 2.67 ENU-PLK 1.66 PTB-ENU 2.79 ENU-PTB 2.51 RUS-ENU 2.72 ENU-RUS 2.20 SVE-ENU 2.83 ENU-SVE 2.40 THA-ENU 2.18 ENU-THA 2.24 TRK-ENU 2.07 ENU-TRK 2.18
  • 17. Quality improvements in 2009 BLEU by Release (EX) BLEU by Release (XE) ARA BGR CHS CSY DAN DEU ELL ESN FIN FRA HEB ITA JPN KOR NLD PLK Aug-08 Jul-08 Aug-09 Jun-08 Dec-08 Oct-08 Jan-09 Jun-09 Jul-09 Apr-08 Sep-08 Nov-08 Feb-09 Oct-09 Dec-09 May-08 Apr-09 Sep-09 Nov-09 Jan-10 Feb-10 Mar-09 May-09 PTB RUS SVE THA 5.4 5.5 5.6 6.0 5.4 5.5 5.6 6.0
  • 18. Experiment Results, measured in BLEU Chinese Test Set System Size System Description General Microsoft Sybase 1 8.3M General domain 14.26 29.74 34.81 2a 2.6M Microsoft 12.32 34.65 29.95 2b 2.8M Microsoft with Sybase 12.16 34.66 30.24 3a 11.5M General and Microsoft and TAUS 15.38 35.80 44.49 3b 11.5M System 3a with Sybase lambda 12.57 29.51 47.16 German Test Set System Size System Description General Microsoft Sybase 1 4.4M General Domain 25.19 40.61 34.85 2a 7.6M Microsoft 21.95 52.39 41.55 2b 7.8M Microsoft with Sybase 22.83 52.07 42.07 3a 11.1M General and Microsoft and TAUS 23.86 52.72 48.83 3b 11.1M System 3a with Sybase lambda 19.44 37.27 50.85 18
  • 19.
  • 20. HAT: A Paradigm Shift Computer Aided Translation is becoming Human Aided Translation Machine Translation is  Good enough to get the meaning across  Not good enough to fully substitute human translation Merge MT with Human Translation using massive amounts of parallel data, and the community of humans
  • 21. Collaborative Translation Framework Community enhanced MT Machine Translation Models Training Worldwide Translation Memory
  • 22. Example: CSS Knowledge Base – Czech Data from Kai Gehrlach, Martine Smets, and Chris Moore
  • 23. Search Engine Optimization Machine Translating the Czech Knowledge Base October 2009 January 2010  2.5% of content of the English  2.5% of content of the English KB is human translated to KB is human translated to Czech, ranked by page view. Czech, ranked by page view.  The top 2.5% cover an estimated  The top 2.5% cover an estimated 50% of the page views. 50% of the page views.  The remaining content is  The remaining content is untranslated. machine translated, starting December 5 and completed over the next 10 days.
  • 24. Referrals from the Czech Republic 140,000  Referrals to the CSS KB site from the top 2 120,000 search engines in the Czech Republic 100,000 (google.cz and seznam.cz 80,000  to the Czech KB (blue) cs  to the KB in other 60,000 All other languages languages (green) 40,000 20,000 0 Oct FY10 Nov FY10 Dec FY10 Jan FY10
  • 25. Resolution Rate Across Languages Arabic Chinese (People's Republic of China) Chinese (Taiwan) Czech French German Italian Resolution rate HT Japanese Resolution rate MT Korean Portuguese Portuguese (Brazil) Russian Spanish Turkish 0% 10% 20% 30% 40% 50% 60% 70% Source: Martine Smets, Microsoft Customer Support
  • 26.
  • 27. Adding Domain Specificity Syntactic tree based decoder Domain Custom Model Generic Other Models Language Target Model Contextual language translation model model Models This model includes The target language models parallel data for the have an effect only if there is Weight distribution domain as well as my matching data in the translation determined by Λ Training company model 27
  • 28. Microsoft Translator Runtime Determines the best (85) alternative (14) Returns result to Distributer Model Leaf Leaf Leaf Leaf Server (4) Model Distributor Leaf Reassembles result Leaf Leaf Leaf Server chunk Finds an engine to translate the sentence Model Load Balancer Distributor Leaf Leaf Leaf Consults models Leaf Server Breaks chunk into Gets a chunk to Model Distributor sentences Leaf Leaf Leaf Leaf translate Server Model Distributor Leaf Leaf Leaf Leaf Server Model Leaf Leaf Leaf Leaf Server
  • 29. Training 400-CPU CCS/HPC cluster Parallel Source language Data parsing Discrim. Train Model model weights weights Treelet + Source/Target Word alignment Syntactic structure word breaking extraction Target language monolingual data Language Surface Phrase table Treelet table Syntactic models model reordering extraction extraction training training training Case Target Distance and Contextual Syntactic Syntactic word restoration language word-based translation reordering insertion and model Target model reordering models model deletion model language Target model language model 29
  • 30. References  Chris Quirk, Arul Menezes, and Colin Cherry, Dependency Treelet Translation: Syntactically Informed Phrasal SMT, in Proceedings of ACL, Association for Computational Linguistics, June 2005  Microsoft Translator: www.microsofttranslator.com  TAUS Data Association: www.tausdata.org 30