Your SlideShare is downloading. ×
0

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

3,206

Published on

LETS MT! …

LETS MT!
This presentation is a part of the MosesCore project that encourages the development and usage of open source machine translation tools, notably the Moses statistical MT toolkit.
MosesCore is supporetd by the European Commission Grant Number 288487 under the 7th Framework Programme.
Latest news on Twitter - #MosesCore

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
3,206
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
32
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASEMoses on the Cloud forDo-It-Yourself MachineTranslationranslationBy Andrejs Vasiļjevs
  • 2. Moses on the Cloud forDo-It-Yourself Machine Translation s Andrejs VasiļjevsChairman of the Board, Tilde andrejs@tilde.com
  • 3. • Language technology developer• Localization service provider• Leadership in smaller languages• Offices in Riga (Latvia), Tallinn (Estonia) and Vilnius (Lithuania)• 135 employees• Strong R&D team• 9 PhDs and candidates
  • 4. machine translationmachine translation
  • 5. d i s r u p t i v eINNOVATIONd i s r u p t i v e
  • 6. CHALLENGE
  • 7. one sizefits all?
  • 8. [ttable-file]0 0 5 /.../unfactored/model/phrase-table.0-0.gz% ls steps/1/LM_toy_tokenize.1* | catsteps/1/LM_toy_tokenize.1steps/1/LM_toy_tokenize.1.DONEsteps/1/LM_toy_tokenize.1.INFOsteps/1/LM_toy_tokenize.1.STDERRsteps/1/LM_toy_tokenize.1.STDERR.digeststeps/1/LM_toy_tokenize.1.STDOUT% train-model.perl --corpus factored-corpus/proj-syndicate --root-dir unfactored --f de --e en --lm 0:3:factored-corpus/surface.lm:0% moses -f moses.ini -lmodel-file "0 0 3../lm/europarl.srilm.gz“use-berkeley = truealignment-symmetrization-method = berkeleyberkeley-train = $moses-script- just usedir/ems/support/berkeley-train.shberkeley-process = $moses-script-dir/ems/support/berkeley-process.shberkeley-jar = /your/path/to/berkeleyaligner- Moses2.1/berkeleyaligner.jarberkeley-java-options = "-server -mx30000m -ea"berkeley-training-options = "-Main.iters 5 5 - ?EMWordAligner.numThreads 8"berkeley-process-options = "-EMWordAligner.numThreads 8"berkeley-posterior = 0.5tokenizein: raw-stemout: tokenized-stemdefault-name: corpus/tokpass-unless: input-tokenizer output-tokenizertemplate-if: input-tokenizer IN.$input-extension OUT.$input-extensiontemplate-if: output-tokenizer IN.$output-extension OUT.$output-extensionparallelizable: yesworking-dir = /home/pkoehn/experiment
  • 9. buildyour ownMT engine!
  • 10. scustomized MT
  • 11. Tilde / CoordinatorLATVIAUniversity of EdinburghUKUppsala UniversitySWEDENCopehagen UniversityDENMARKUniversity of ZagrebCROATIAMoraviaCZECH REPUBLICSemLabNETHERLANDS
  • 12. • Online collaborative platform for MT building from user-provided data• Repository of parallel and monolingual corpora for MT generation• Automated training of SMT systems from specified collections of data• Users can specify particular training data collections and build customised MT engines from these collections• Users can also use LetsMT! platform for tailoring MT system to their needs from their non- public data
  • 13. • User-driven cloud-based MT factory, based on open-source MT tools• Services for data collection, MT generation, customization and running of variety of user- tailored MT systems• Application in localization among the key usage scenarios• Strong synergy with FP7 project ACCURAT to advance data-driven machine translation for under- resourced languages and domains
  • 14. • Stores SMT training data • Supports different formats – TMX, XLIFF, PDF, DOC, plain text • Converts to unified format • Performs format conversions and alignmentResourceRepository
  • 15. cMT
  • 16. • Integration with CAT tools • Integration in web pages • Integration in web browsers • API-level integrationintegration
  • 17. Sharing of training data Training Using Web page Anonymous access Web page Procesing, Evaluation ... translation widget SMT Resource SMT Multi-Model Repository Repository Web browserUpload Giza++ (trained SMT models) Moses SMT toolkit Plug-ins SMT Resource SMT System Directory Directory Web service Authenticated access CAT tools Moses decoder System management, user authentication, access rights control ...
  • 18. System s Architecture Web Browser CAT tools CAT tools CAT tools Widget ... Browsers plug-ins REST, SOAP, ... http/https TCP/IP REST https REST https https htmlInterface Layer Web Page UI Public API User interface REST/SOAP REST/SOAP webpage UI, web service API http httpApplication Logic Layer Resource Repository Adapter SMT training Translation Application Logic Resource Repository REST Data Storage Layer High-performance Computing (HPC) Cluster (Resource Repository) stores MT training data and RR API trained models REST HPC frontend SGE CPU File Share CPU CPU CPU SVN CPU CPU High-performance Computing CPU CPU CPU System DB Cluster executes all computationally heavy tasks: SMT training, MT service, Processing and aligning of training data etc.
  • 19. Latvian %32.9%* productivity * Skadiņš R., Puriņš M., Skadiņa I., Vasiļjevs A., Evaluation of SMT in localization to under-resourced inflected language, in Proceedings of the 15th International Conference of the European Association for Machine Translation EAMT 2011, p. 35-40, May 30-31, 2011, Leuven, Belgium
  • 20. Czech Polish % productivity 28.5%25.1% * LetsMT! Project Deliverable D6.4
  • 21. • incremental training,New Moses • distributed language modelsfeatures • interpolated language models for domain adaptation • randomized language models to train using huge corpora • translation of formatted texts • running Moses decoder in a server mode
  • 22. tilde.com technologies for smaller languagesThe research within the LetsMT! project leading to these results has received funding from the ICT Policy Support Programme (ICT PSP), Theme 5 – Multilingual web, grant agreement no 250456

×