SlideShare a Scribd company logo
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE


Moses on the Cloud for
Do-It-Yourself Machine
Translationranslation

By Andrejs Vasiļjevs
Moses on the Cloud for
Do-It-Yourself Machine
      Translation
             s
      Andrejs Vasiļjevs
Chairman of the Board, Tilde
     andrejs@tilde.com
• Language technology
  developer
• Localization service
  provider
• Leadership in smaller
  languages
• Offices in Riga (Latvia),
  Tallinn (Estonia) and Vilnius
  (Lithuania)
• 135 employees
• Strong R&D team
• 9 PhDs and candidates
machine translation




machine translation
d i s r u p t i v e


INNOVATION
d i s r u p t i v e
CHALLENGE
one size
fits all

?
[ttable-file]
0 0 5 /.../unfactored/model/phrase-table.0-0.gz
% ls steps/1/LM_toy_tokenize.1* | cat
steps/1/LM_toy_tokenize.1
steps/1/LM_toy_tokenize.1.DONE
steps/1/LM_toy_tokenize.1.INFO
steps/1/LM_toy_tokenize.1.STDERR
steps/1/LM_toy_tokenize.1.STDERR.digest
steps/1/LM_toy_tokenize.1.STDOUT
% train-model.perl 
--corpus factored-corpus/proj-syndicate 
--root-dir unfactored 
--f de --e en 
--lm 0:3:factored-corpus/surface.lm:0
% moses -f moses.ini -lmodel-file "0 0 3
../lm/europarl.srilm.gz“
use-berkeley = true
alignment-symmetrization-method = berkeley
berkeley-train = $moses-script-
                                                  just use
dir/ems/support/berkeley-train.sh
berkeley-process = $moses-script-
dir/ems/support/berkeley-process.sh
berkeley-jar = /your/path/to/berkeleyaligner-
                                                  Moses
2.1/berkeleyaligner.jar
berkeley-java-options = "-server -mx30000m -ea"
berkeley-training-options = "-Main.iters 5 5 -
                                                  ?
EMWordAligner.numThreads 8"
berkeley-process-options = "-
EMWordAligner.numThreads 8"
berkeley-posterior = 0.5
tokenize
in: raw-stem
out: tokenized-stem
default-name: corpus/tok
pass-unless: input-tokenizer output-tokenizer
template-if: input-tokenizer IN.$input-
extension OUT.$input-extension
template-if: output-tokenizer IN.$output-
extension OUT.$output-extension
parallelizable: yes
working-dir = /home/pkoehn/experiment
build
your own
MT engine
!
s

customized MT
Tilde / Coordinator
LATVIA



University of Edinburgh
UK



Uppsala University
SWEDEN



Copehagen University
DENMARK



University of Zagreb
CROATIA



Moravia
CZECH REPUBLIC




SemLab
NETHERLANDS
• Online collaborative platform for
  MT building from user-provided
  data
• Repository of parallel and
  monolingual corpora for MT
  generation
• Automated training of SMT
  systems from specified
  collections of data
• Users can specify particular
  training data collections and
  build customised MT engines
  from these collections
• Users can also use LetsMT!
  platform for tailoring MT system
  to their needs from their non-
  public data
• User-driven cloud-based MT
  factory, based on open-source
  MT tools
• Services for data collection, MT
  generation, customization and
  running of variety of user-
  tailored MT systems
• Application in localization among
  the key usage scenarios
• Strong synergy with FP7 project
  ACCURAT to advance data-driven
  machine translation for under-
  resourced languages and
  domains
• Stores SMT training data
             • Supports different formats –
               TMX, XLIFF, PDF, DOC, plain
               text
             • Converts to unified format
             • Performs format
               conversions and alignment
Resource
Repository
c



MT
• Integration with CAT tools
              • Integration in web pages
              • Integration in web browsers
              • API-level integration

integration
Sharing of training data                                          Training                                         Using

                                                                                                                                                  Web page




                                                                                                                              Anonymous
                                                                                                                                access
                                                                                                                                                  Web page
             Procesing, Evaluation ...




                                                                                                                                              translation widget
                                         SMT Resource                                                   SMT Multi-Model
                                          Repository                                                       Repository
                                                                                                                                                Web browser
Upload




                                                                           Giza++
                                                                                                     (trained SMT models)
                                                                       Moses SMT toolkit                                                          Plug-ins

                                         SMT Resource                                                     SMT System
                                           Directory                                                       Directory

                                                                                                                                                Web service




                                                                                                                              Authenticated
                                                                                                                                 access
                                                                                                                                                  CAT tools
                                                                                                          Moses decoder




                                                        System management, user authentication, access rights control ...
System
                                                                                                                                       s Architecture
                          Web                                                           Browser
                                                     CAT tools
                                                      CAT tools                         CAT tools                 Widget         ...
                        Browsers                                                        plug-ins




                                                    REST, SOAP, ...
                          http/https




                                                                       TCP/IP




                                                                                                                 REST
                                                                                                                 https
                                                                                      REST
                                                        https




                                                                                      https
                             html
Interface Layer

                     Web Page UI                                                                  Public API
                                                                                                                                        User interface
                          REST/SOAP




                                                                                      REST/SOAP
                                                                                                                                        webpage UI, web service API
                             http




                                                                                         http
Application Logic Layer
       Resource
      Repository
       Adapter
                                                                  SMT training                     Translation
                                                                                                                                        Application Logic Resource
                                                                                                                                        Repository
       REST




                Data Storage Layer                High-performance Computing (HPC) Cluster
              (Resource Repository)
                                                                                                                                        stores MT training data and
      RR API

                                                                                                                                        trained models
                                       REST                  HPC frontend                                 SGE              CPU



                                                                      File Share                    CPU           CPU      CPU



       SVN
                                                              CPU               CPU
                                                                                                                                        High-performance Computing
                                                                                                    CPU           CPU      CPU
                                         System
                                           DB
                                                                                                                                        Cluster
                                                                                                                                        executes all computationally
                                                                                                                                        heavy tasks: SMT training, MT
                                                                                                                                        service, Processing and
                                                                                                                                        aligning of training data etc.
Latvian
           %

32.9%*        productivity




           * Skadiņš R., Puriņš M., Skadiņa I., Vasiļjevs A., Evaluation of SMT in
             localization to under-resourced inflected language, in Proceedings
             of the 15th International Conference of the European Association
             for Machine Translation EAMT 2011, p. 35-40, May 30-31, 2011,
             Leuven, Belgium
Czech   Polish
                 %


                 productivity
        28.5%
25.1%



                 * LetsMT! Project Deliverable D6.4
• incremental training,
New Moses
            • distributed language models
features
            • interpolated language models
              for domain adaptation
            • randomized language models to
              train using huge corpora
            • translation of formatted texts
            • running Moses decoder in a
              server mode
tilde.com
                                               technologies
                                                     for
                                                  smaller
                                                languages


The research within the LetsMT! project leading to these results has received funding from the ICT Policy Support
                Programme (ICT PSP), Theme 5 – Multilingual web, grant agreement no 250456

More Related Content

Similar to TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

Enabling Content Workflows in the Cloud
Enabling Content Workflows in the CloudEnabling Content Workflows in the Cloud
Enabling Content Workflows in the Cloud
Amazon Web Services
 
Customer presentation: Trisys, Introduction to AWS, Cambridge
Customer presentation: Trisys, Introduction to AWS, CambridgeCustomer presentation: Trisys, Introduction to AWS, Cambridge
Customer presentation: Trisys, Introduction to AWS, Cambridge
Amazon Web Services
 
Mee go是您的新机遇
Mee go是您的新机遇Mee go是您的新机遇
Mee go是您的新机遇
OpenSourceCamp
 
Forum Nokia Dev. Camp - WRT training Paris_17&18 Nov.
Forum Nokia Dev. Camp - WRT training Paris_17&18 Nov.Forum Nokia Dev. Camp - WRT training Paris_17&18 Nov.
Forum Nokia Dev. Camp - WRT training Paris_17&18 Nov.
DALEZ
 

Similar to TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012 (20)

Enabling Content Workflows in the Cloud
Enabling Content Workflows in the CloudEnabling Content Workflows in the Cloud
Enabling Content Workflows in the Cloud
 
Distributed Shared Memory on Ericsson Labs
Distributed Shared Memory on Ericsson LabsDistributed Shared Memory on Ericsson Labs
Distributed Shared Memory on Ericsson Labs
 
Patterns of Cloud Applications Using Microsoft Azure Services Platform
Patterns of Cloud Applications Using Microsoft Azure Services PlatformPatterns of Cloud Applications Using Microsoft Azure Services Platform
Patterns of Cloud Applications Using Microsoft Azure Services Platform
 
Software architecture
Software architectureSoftware architecture
Software architecture
 
Azure Services Platform
Azure Services PlatformAzure Services Platform
Azure Services Platform
 
2009 Q2 WSO2 Technical Update
2009 Q2 WSO2 Technical Update2009 Q2 WSO2 Technical Update
2009 Q2 WSO2 Technical Update
 
A Lap Around Silverlight 5
A Lap Around Silverlight 5A Lap Around Silverlight 5
A Lap Around Silverlight 5
 
The CORA Model Explained
The CORA Model ExplainedThe CORA Model Explained
The CORA Model Explained
 
(ATS3-APP09) Integrating Symyx Notebook into an Enterprise Management System
(ATS3-APP09) Integrating Symyx Notebook into an Enterprise Management System(ATS3-APP09) Integrating Symyx Notebook into an Enterprise Management System
(ATS3-APP09) Integrating Symyx Notebook into an Enterprise Management System
 
Customer presentation: Trisys, Introduction to AWS, Cambridge
Customer presentation: Trisys, Introduction to AWS, CambridgeCustomer presentation: Trisys, Introduction to AWS, Cambridge
Customer presentation: Trisys, Introduction to AWS, Cambridge
 
Transaction-based Capacity Planning for greater IT Reliability™ webinar
Transaction-based Capacity Planning for greater IT Reliability™ webinar Transaction-based Capacity Planning for greater IT Reliability™ webinar
Transaction-based Capacity Planning for greater IT Reliability™ webinar
 
Best Practices for Upgrading Your Portal to SAP NetWeaver 7.3
Best Practices for Upgrading Your Portal to SAP NetWeaver 7.3Best Practices for Upgrading Your Portal to SAP NetWeaver 7.3
Best Practices for Upgrading Your Portal to SAP NetWeaver 7.3
 
Mozilla In Malaysia
Mozilla In MalaysiaMozilla In Malaysia
Mozilla In Malaysia
 
02 Ms Online Identity Session 1
02 Ms Online Identity   Session 102 Ms Online Identity   Session 1
02 Ms Online Identity Session 1
 
Mee go是您的新机遇
Mee go是您的新机遇Mee go是您的新机遇
Mee go是您的新机遇
 
First Operational Technology (OT) High Performance Messaging Patterns for Ent...
First Operational Technology (OT) High Performance Messaging Patterns for Ent...First Operational Technology (OT) High Performance Messaging Patterns for Ent...
First Operational Technology (OT) High Performance Messaging Patterns for Ent...
 
S60 3rd FP2 Widgets
S60 3rd FP2 WidgetsS60 3rd FP2 Widgets
S60 3rd FP2 Widgets
 
Integration SharePoint 2010 with CRM 2010 by Mai Omar Desouki
Integration SharePoint 2010 with CRM 2010 by Mai Omar DesoukiIntegration SharePoint 2010 with CRM 2010 by Mai Omar Desouki
Integration SharePoint 2010 with CRM 2010 by Mai Omar Desouki
 
Windows Azure Interoperability
Windows Azure InteroperabilityWindows Azure Interoperability
Windows Azure Interoperability
 
Forum Nokia Dev. Camp - WRT training Paris_17&18 Nov.
Forum Nokia Dev. Camp - WRT training Paris_17&18 Nov.Forum Nokia Dev. Camp - WRT training Paris_17&18 Nov.
Forum Nokia Dev. Camp - WRT training Paris_17&18 Nov.
 

More from TAUS - The Language Data Network

More from TAUS - The Language Data Network (20)

TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
 
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
 
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
 
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
 
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
 
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
 
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
 
Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
 Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann... Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
 
A translation memory P2P trading platform - to make global translation memory...
A translation memory P2P trading platform - to make global translation memory...A translation memory P2P trading platform - to make global translation memory...
A translation memory P2P trading platform - to make global translation memory...
 
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
 
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
 
Farmer Lv (TrueTran)
Farmer Lv (TrueTran)Farmer Lv (TrueTran)
Farmer Lv (TrueTran)
 
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
 
The Theory and Practice of Computer Aided Translation Training System, Liu Q...
 The Theory and Practice of Computer Aided Translation Training System, Liu Q... The Theory and Practice of Computer Aided Translation Training System, Liu Q...
The Theory and Practice of Computer Aided Translation Training System, Liu Q...
 
Translation Technology Showcase in Shenzhen
Translation Technology Showcase in ShenzhenTranslation Technology Showcase in Shenzhen
Translation Technology Showcase in Shenzhen
 
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
 
SDL Trados Studio 2017, Jocelyn He (SDL)
SDL Trados Studio 2017, Jocelyn He (SDL)SDL Trados Studio 2017, Jocelyn He (SDL)
SDL Trados Studio 2017, Jocelyn He (SDL)
 
How we train post-editors - Yongpeng Wei (Lingosail)
How we train post-editors - Yongpeng Wei (Lingosail)How we train post-editors - Yongpeng Wei (Lingosail)
How we train post-editors - Yongpeng Wei (Lingosail)
 
A use-case for getting MT into your company, Kerstin Berns (berns language c...
 A use-case for getting MT into your company, Kerstin Berns (berns language c... A use-case for getting MT into your company, Kerstin Berns (berns language c...
A use-case for getting MT into your company, Kerstin Berns (berns language c...
 
QE integrated in XTM, by Bob Willans (XTM)
QE integrated in XTM, by Bob Willans (XTM)QE integrated in XTM, by Bob Willans (XTM)
QE integrated in XTM, by Bob Willans (XTM)
 

Recently uploaded

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 

Recently uploaded (20)

UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 

TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Monaco, Andrejs Vasiljevs, Tilde, 25 March 2012

  • 1. TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE Moses on the Cloud for Do-It-Yourself Machine Translationranslation By Andrejs Vasiļjevs
  • 2. Moses on the Cloud for Do-It-Yourself Machine Translation s Andrejs Vasiļjevs Chairman of the Board, Tilde andrejs@tilde.com
  • 3. • Language technology developer • Localization service provider • Leadership in smaller languages • Offices in Riga (Latvia), Tallinn (Estonia) and Vilnius (Lithuania) • 135 employees • Strong R&D team • 9 PhDs and candidates
  • 5. d i s r u p t i v e INNOVATION d i s r u p t i v e
  • 7.
  • 8.
  • 10. [ttable-file] 0 0 5 /.../unfactored/model/phrase-table.0-0.gz % ls steps/1/LM_toy_tokenize.1* | cat steps/1/LM_toy_tokenize.1 steps/1/LM_toy_tokenize.1.DONE steps/1/LM_toy_tokenize.1.INFO steps/1/LM_toy_tokenize.1.STDERR steps/1/LM_toy_tokenize.1.STDERR.digest steps/1/LM_toy_tokenize.1.STDOUT % train-model.perl --corpus factored-corpus/proj-syndicate --root-dir unfactored --f de --e en --lm 0:3:factored-corpus/surface.lm:0 % moses -f moses.ini -lmodel-file "0 0 3 ../lm/europarl.srilm.gz“ use-berkeley = true alignment-symmetrization-method = berkeley berkeley-train = $moses-script- just use dir/ems/support/berkeley-train.sh berkeley-process = $moses-script- dir/ems/support/berkeley-process.sh berkeley-jar = /your/path/to/berkeleyaligner- Moses 2.1/berkeleyaligner.jar berkeley-java-options = "-server -mx30000m -ea" berkeley-training-options = "-Main.iters 5 5 - ? EMWordAligner.numThreads 8" berkeley-process-options = "- EMWordAligner.numThreads 8" berkeley-posterior = 0.5 tokenize in: raw-stem out: tokenized-stem default-name: corpus/tok pass-unless: input-tokenizer output-tokenizer template-if: input-tokenizer IN.$input- extension OUT.$input-extension template-if: output-tokenizer IN.$output- extension OUT.$output-extension parallelizable: yes working-dir = /home/pkoehn/experiment
  • 13. Tilde / Coordinator LATVIA University of Edinburgh UK Uppsala University SWEDEN Copehagen University DENMARK University of Zagreb CROATIA Moravia CZECH REPUBLIC SemLab NETHERLANDS
  • 14. • Online collaborative platform for MT building from user-provided data • Repository of parallel and monolingual corpora for MT generation • Automated training of SMT systems from specified collections of data • Users can specify particular training data collections and build customised MT engines from these collections • Users can also use LetsMT! platform for tailoring MT system to their needs from their non- public data
  • 15. • User-driven cloud-based MT factory, based on open-source MT tools • Services for data collection, MT generation, customization and running of variety of user- tailored MT systems • Application in localization among the key usage scenarios • Strong synergy with FP7 project ACCURAT to advance data-driven machine translation for under- resourced languages and domains
  • 16. • Stores SMT training data • Supports different formats – TMX, XLIFF, PDF, DOC, plain text • Converts to unified format • Performs format conversions and alignment Resource Repository
  • 17. c MT
  • 18. • Integration with CAT tools • Integration in web pages • Integration in web browsers • API-level integration integration
  • 19. Sharing of training data Training Using Web page Anonymous access Web page Procesing, Evaluation ... translation widget SMT Resource SMT Multi-Model Repository Repository Web browser Upload Giza++ (trained SMT models) Moses SMT toolkit Plug-ins SMT Resource SMT System Directory Directory Web service Authenticated access CAT tools Moses decoder System management, user authentication, access rights control ...
  • 20. System s Architecture Web Browser CAT tools CAT tools CAT tools Widget ... Browsers plug-ins REST, SOAP, ... http/https TCP/IP REST https REST https https html Interface Layer Web Page UI Public API User interface REST/SOAP REST/SOAP webpage UI, web service API http http Application Logic Layer Resource Repository Adapter SMT training Translation Application Logic Resource Repository REST Data Storage Layer High-performance Computing (HPC) Cluster (Resource Repository) stores MT training data and RR API trained models REST HPC frontend SGE CPU File Share CPU CPU CPU SVN CPU CPU High-performance Computing CPU CPU CPU System DB Cluster executes all computationally heavy tasks: SMT training, MT service, Processing and aligning of training data etc.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25. Latvian % 32.9%* productivity * Skadiņš R., Puriņš M., Skadiņa I., Vasiļjevs A., Evaluation of SMT in localization to under-resourced inflected language, in Proceedings of the 15th International Conference of the European Association for Machine Translation EAMT 2011, p. 35-40, May 30-31, 2011, Leuven, Belgium
  • 26. Czech Polish % productivity 28.5% 25.1% * LetsMT! Project Deliverable D6.4
  • 27. • incremental training, New Moses • distributed language models features • interpolated language models for domain adaptation • randomized language models to train using huge corpora • translation of formatted texts • running Moses decoder in a server mode
  • 28. tilde.com technologies for smaller languages The research within the LetsMT! project leading to these results has received funding from the ICT Policy Support Programme (ICT PSP), Theme 5 – Multilingual web, grant agreement no 250456