TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE


Moses on the Cloud for
Do-It-Yourself Machine
Translationranslation

By Andrejs Vasiļjevs
Moses on the Cloud for
Do-It-Yourself Machine
      Translation
             s
      Andrejs Vasiļjevs
Chairman of the Board, Tilde
     andrejs@tilde.com
• Language technology
  developer
• Localization service
  provider
• Leadership in smaller
  languages
• Offices in Riga (Latvia),
  Tallinn (Estonia) and Vilnius
  (Lithuania)
• 135 employees
• Strong R&D team
• 9 PhDs and candidates
machine translation




machine translation
d i s r u p t i v e


INNOVATION
d i s r u p t i v e
CHALLENGE
one size
fits all

?
[ttable-file]
0 0 5 /.../unfactored/model/phrase-table.0-0.gz
% ls steps/1/LM_toy_tokenize.1* | cat
steps/1/LM_toy_tokenize.1
steps/1/LM_toy_tokenize.1.DONE
steps/1/LM_toy_tokenize.1.INFO
steps/1/LM_toy_tokenize.1.STDERR
steps/1/LM_toy_tokenize.1.STDERR.digest
steps/1/LM_toy_tokenize.1.STDOUT
% train-model.perl 
--corpus factored-corpus/proj-syndicate 
--root-dir unfactored 
--f de --e en 
--lm 0:3:factored-corpus/surface.lm:0
% moses -f moses.ini -lmodel-file "0 0 3
../lm/europarl.srilm.gz“
use-berkeley = true
alignment-symmetrization-method = berkeley
berkeley-train = $moses-script-
                                                  just use
dir/ems/support/berkeley-train.sh
berkeley-process = $moses-script-
dir/ems/support/berkeley-process.sh
berkeley-jar = /your/path/to/berkeleyaligner-
                                                  Moses
2.1/berkeleyaligner.jar
berkeley-java-options = "-server -mx30000m -ea"
berkeley-training-options = "-Main.iters 5 5 -
                                                  ?
EMWordAligner.numThreads 8"
berkeley-process-options = "-
EMWordAligner.numThreads 8"
berkeley-posterior = 0.5
tokenize
in: raw-stem
out: tokenized-stem
default-name: corpus/tok
pass-unless: input-tokenizer output-tokenizer
template-if: input-tokenizer IN.$input-
extension OUT.$input-extension
template-if: output-tokenizer IN.$output-
extension OUT.$output-extension
parallelizable: yes
working-dir = /home/pkoehn/experiment
build
your own
MT engine
!
s

customized MT
Tilde / Coordinator
LATVIA



University of Edinburgh
UK



Uppsala University
SWEDEN



Copehagen University
DENMARK



University of Zagreb
CROATIA



Moravia
CZECH REPUBLIC




SemLab
NETHERLANDS
• Online collaborative platform for
  MT building from user-provided
  data
• Repository of parallel and
  monolingual corpora for MT
  generation
• Automated training of SMT
  systems from specified
  collections of data
• Users can specify particular
  training data collections and
  build customised MT engines
  from these collections
• Users can also use LetsMT!
  platform for tailoring MT system
  to their needs from their non-
  public data
• User-driven cloud-based MT
  factory, based on open-source
  MT tools
• Services for data collection, MT
  generation, customization and
  running of variety of user-
  tailored MT systems
• Application in localization among
  the key usage scenarios
• Strong synergy with FP7 project
  ACCURAT to advance data-driven
  machine translation for under-
  resourced languages and
  domains
• Stores SMT training data
             • Supports different formats –
               TMX, XLIFF, PDF, DOC, plain
               text
             • Converts to unified format
             • Performs format
               conversions and alignment
Resource
Repository
c



MT
• Integration with CAT tools
              • Integration in web pages
              • Integration in web browsers
              • API-level integration

integration
Sharing of training data                                          Training                                         Using

                                                                                                                                                  Web page




                                                                                                                              Anonymous
                                                                                                                                access
                                                                                                                                                  Web page
             Procesing, Evaluation ...




                                                                                                                                              translation widget
                                         SMT Resource                                                   SMT Multi-Model
                                          Repository                                                       Repository
                                                                                                                                                Web browser
Upload




                                                                           Giza++
                                                                                                     (trained SMT models)
                                                                       Moses SMT toolkit                                                          Plug-ins

                                         SMT Resource                                                     SMT System
                                           Directory                                                       Directory

                                                                                                                                                Web service




                                                                                                                              Authenticated
                                                                                                                                 access
                                                                                                                                                  CAT tools
                                                                                                          Moses decoder




                                                        System management, user authentication, access rights control ...
System
                                                                                                                                       s Architecture
                          Web                                                           Browser
                                                     CAT tools
                                                      CAT tools                         CAT tools                 Widget         ...
                        Browsers                                                        plug-ins




                                                    REST, SOAP, ...
                          http/https




                                                                       TCP/IP




                                                                                                                 REST
                                                                                                                 https
                                                                                      REST
                                                        https




                                                                                      https
                             html
Interface Layer

                     Web Page UI                                                                  Public API
                                                                                                                                        User interface
                          REST/SOAP




                                                                                      REST/SOAP
                                                                                                                                        webpage UI, web service API
                             http




                                                                                         http
Application Logic Layer
       Resource
      Repository
       Adapter
                                                                  SMT training                     Translation
                                                                                                                                        Application Logic Resource
                                                                                                                                        Repository
       REST




                Data Storage Layer                High-performance Computing (HPC) Cluster
              (Resource Repository)
                                                                                                                                        stores MT training data and
      RR API

                                                                                                                                        trained models
                                       REST                  HPC frontend                                 SGE              CPU



                                                                      File Share                    CPU           CPU      CPU



       SVN
                                                              CPU               CPU
                                                                                                                                        High-performance Computing
                                                                                                    CPU           CPU      CPU
                                         System
                                           DB
                                                                                                                                        Cluster
                                                                                                                                        executes all computationally
                                                                                                                                        heavy tasks: SMT training, MT
                                                                                                                                        service, Processing and
                                                                                                                                        aligning of training data etc.
Latvian
           %

32.9%*        productivity




           * Skadiņš R., Puriņš M., Skadiņa I., Vasiļjevs A., Evaluation of SMT in
             localization to under-resourced inflected language, in Proceedings
             of the 15th International Conference of the European Association
             for Machine Translation EAMT 2011, p. 35-40, May 30-31, 2011,
             Leuven, Belgium
Czech   Polish
                 %


                 productivity
        28.5%
25.1%



                 * LetsMT! Project Deliverable D6.4
• incremental training,
New Moses
            • distributed language models
features
            • interpolated language models
              for domain adaptation
            • randomized language models to
              train using huge corpora
            • translation of formatted texts
            • running Moses decoder in a
              server mode
tilde.com
                                               technologies
                                                     for
                                                  smaller
                                                languages


The research within the LetsMT! project leading to these results has received funding from the ICT Policy Support
                Programme (ICT PSP), Theme 5 – Multilingual web, grant agreement no 250456

TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

  • 1.
    TAUS OPEN SOURCEMACHINE TRANSLATION SHOWCASE Moses on the Cloud for Do-It-Yourself Machine Translationranslation By Andrejs Vasiļjevs
  • 2.
    Moses on theCloud for Do-It-Yourself Machine Translation s Andrejs Vasiļjevs Chairman of the Board, Tilde andrejs@tilde.com
  • 3.
    • Language technology developer • Localization service provider • Leadership in smaller languages • Offices in Riga (Latvia), Tallinn (Estonia) and Vilnius (Lithuania) • 135 employees • Strong R&D team • 9 PhDs and candidates
  • 4.
  • 5.
    d i sr u p t i v e INNOVATION d i s r u p t i v e
  • 6.
  • 9.
  • 10.
    [ttable-file] 0 0 5/.../unfactored/model/phrase-table.0-0.gz % ls steps/1/LM_toy_tokenize.1* | cat steps/1/LM_toy_tokenize.1 steps/1/LM_toy_tokenize.1.DONE steps/1/LM_toy_tokenize.1.INFO steps/1/LM_toy_tokenize.1.STDERR steps/1/LM_toy_tokenize.1.STDERR.digest steps/1/LM_toy_tokenize.1.STDOUT % train-model.perl --corpus factored-corpus/proj-syndicate --root-dir unfactored --f de --e en --lm 0:3:factored-corpus/surface.lm:0 % moses -f moses.ini -lmodel-file "0 0 3 ../lm/europarl.srilm.gz“ use-berkeley = true alignment-symmetrization-method = berkeley berkeley-train = $moses-script- just use dir/ems/support/berkeley-train.sh berkeley-process = $moses-script- dir/ems/support/berkeley-process.sh berkeley-jar = /your/path/to/berkeleyaligner- Moses 2.1/berkeleyaligner.jar berkeley-java-options = "-server -mx30000m -ea" berkeley-training-options = "-Main.iters 5 5 - ? EMWordAligner.numThreads 8" berkeley-process-options = "- EMWordAligner.numThreads 8" berkeley-posterior = 0.5 tokenize in: raw-stem out: tokenized-stem default-name: corpus/tok pass-unless: input-tokenizer output-tokenizer template-if: input-tokenizer IN.$input- extension OUT.$input-extension template-if: output-tokenizer IN.$output- extension OUT.$output-extension parallelizable: yes working-dir = /home/pkoehn/experiment
  • 11.
  • 12.
  • 13.
    Tilde / Coordinator LATVIA Universityof Edinburgh UK Uppsala University SWEDEN Copehagen University DENMARK University of Zagreb CROATIA Moravia CZECH REPUBLIC SemLab NETHERLANDS
  • 14.
    • Online collaborativeplatform for MT building from user-provided data • Repository of parallel and monolingual corpora for MT generation • Automated training of SMT systems from specified collections of data • Users can specify particular training data collections and build customised MT engines from these collections • Users can also use LetsMT! platform for tailoring MT system to their needs from their non- public data
  • 15.
    • User-driven cloud-basedMT factory, based on open-source MT tools • Services for data collection, MT generation, customization and running of variety of user- tailored MT systems • Application in localization among the key usage scenarios • Strong synergy with FP7 project ACCURAT to advance data-driven machine translation for under- resourced languages and domains
  • 16.
    • Stores SMTtraining data • Supports different formats – TMX, XLIFF, PDF, DOC, plain text • Converts to unified format • Performs format conversions and alignment Resource Repository
  • 17.
  • 18.
    • Integration withCAT tools • Integration in web pages • Integration in web browsers • API-level integration integration
  • 19.
    Sharing of trainingdata Training Using Web page Anonymous access Web page Procesing, Evaluation ... translation widget SMT Resource SMT Multi-Model Repository Repository Web browser Upload Giza++ (trained SMT models) Moses SMT toolkit Plug-ins SMT Resource SMT System Directory Directory Web service Authenticated access CAT tools Moses decoder System management, user authentication, access rights control ...
  • 20.
    System s Architecture Web Browser CAT tools CAT tools CAT tools Widget ... Browsers plug-ins REST, SOAP, ... http/https TCP/IP REST https REST https https html Interface Layer Web Page UI Public API User interface REST/SOAP REST/SOAP webpage UI, web service API http http Application Logic Layer Resource Repository Adapter SMT training Translation Application Logic Resource Repository REST Data Storage Layer High-performance Computing (HPC) Cluster (Resource Repository) stores MT training data and RR API trained models REST HPC frontend SGE CPU File Share CPU CPU CPU SVN CPU CPU High-performance Computing CPU CPU CPU System DB Cluster executes all computationally heavy tasks: SMT training, MT service, Processing and aligning of training data etc.
  • 25.
    Latvian % 32.9%* productivity * Skadiņš R., Puriņš M., Skadiņa I., Vasiļjevs A., Evaluation of SMT in localization to under-resourced inflected language, in Proceedings of the 15th International Conference of the European Association for Machine Translation EAMT 2011, p. 35-40, May 30-31, 2011, Leuven, Belgium
  • 26.
    Czech Polish % productivity 28.5% 25.1% * LetsMT! Project Deliverable D6.4
  • 27.
    • incremental training, NewMoses • distributed language models features • interpolated language models for domain adaptation • randomized language models to train using huge corpora • translation of formatted texts • running Moses decoder in a server mode
  • 28.
    tilde.com technologies for smaller languages The research within the LetsMT! project leading to these results has received funding from the ICT Policy Support Programme (ICT PSP), Theme 5 – Multilingual web, grant agreement no 250456