WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization World


Published on

WeMT Tools and Processes, a presentation by Olga Beregovaya at Localization World 2013 in Silicon Valley. Presented during TAUS Showcase. Discussion of automation and machine translation programs. Welocalize is the leader in localization and translation solutions.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Our KPIs: organic; the list can increase or be adapted to a new situation depending on the particular needs.
  • WeMT Tools and Processes Welocalize TAUS Showcase October 2013 Localization World

    1. 1. WeMT Tools and Processes TAUS Showcase October 2013 By Olga Beregovaya copyright © welocalize 2013. all rights reserved. www.welocalize.com
    2. 2. We’ll talk about: • MT Programs • Metrics • Engines • Language Tools www.welocalize.com
    3. 3. Current MT Programs Dell – 27 languages Autodesk – 11 languages PayPal - 8 languages Cisco – 17 languages between 3 tiers Intuit – 20+languages Microsoft (pre-project support) McAfee (pilot) … many more in pilot stage
    4. 4. MT Program: Path-to-Success Components A set of MT engines – “mix and match” TMT Selection Mechanisms Post-editing Environment Processes and metrics Data gathering and reporting tool – what, how much, how fast and at what effort EDUCATION EDUCATION EDUCATION CHANGE The recipe for success
    5. 5. Process and Workflow All aspects of the localization ecosystem are taken into consideration MT KPIs: Selecting the right MT engine By using our MT engine selection Scorecard we make sure all important KPIs are taken into consideration at selection time Empowerment through education Internal, by the use of customized Toolkits; external, through specialised Trainings. The feedback loop Constructive communication from post-editor to MT provider  Productivity: Throughputs  Productivity: Delta  Quality: LQA  Quality: Automatic Scores  Cost  GlobalSight: Connectivity  GlobalSight: Tagging  Human Evaluation  Customization: Internal/External  Customization: Time
    6. 6. MT Program Design - Source o o o o o o Source content classification (i.e. marketing/UI/UA/UGC) Length of the source segment Source segment morpho-syntactic complexity Presence/absence of pre-defined glossary terms or multi-word glossary elements, UI elements, numeric variables, product lists, ‘do-not-translate’ and transliteration lists Tag density - Metadata attributes and their representation in localization industry standard formats (“tags”) ROC – quality levels based on content use (“impact”) 3D Model: Expected productivity mapped to desired quality levels and source content complexity copyright © welocalize 2013. all rights reserved. www.welocalize.com
    7. 7. MT Engine Selection Scorecard Productivity - Throughputs Number of post-edited words per hour Productivity - Delta Percentage difference between translation and postediting time Cost Extrapolation, cost per word CMS - Connectivity We have tested and used Is there a connector in place? different engines so we’ve seen Quality/Nature of source the good, the bad and the ugly; now we can better appreciate Quality (Final) - LQA what we have Internal quality verification Quality (MT) - Automatic Scores A set of automatic scoring systems is used
    8. 8. Scorecard - Metrics Overall data Productivity metrics Automatic Scoring Human Evaluation
    9. 9. Toolkits and Trainings Our experience:  Most translators know and have experienced post-editing but they have limited knowledge of any other related aspect (automatic scoring, output differences between RBMT and SMT...)  The majority of people who work in localization have heard about MT but most of them still find it a daunting subject. Our answer:  Continuous MT and PE related trainings and documentation for language providers  Customized Toolkits for different internal departments (Production, Quality, Sales, Vendor Management) copyright © welocalize 2013. all rights reserved. www.welocalize.com
    10. 10. Transparency and Ownership Theory – knowledge foundations Practice – customized PE sessions for different client accounts Transparency – process, engine selection/customization, evaluations Training helps a lot - After I was told some of the background information and tips and tricks for certain engines/outputs, I was much more relaxed and happy to give MT a go. Responsibility – valid evaluations, constructive feedback, quality ownership
    11. 11. Legacy data – best prediction tool > Statistics from legacy knowledge base
    12. 12. The feedback loop For me the biggest advantage would be the possibility to implement a client terminology list [in SMT] I wish we could easily fix the corpus for outdated terminology and characters Teach the engine to properly cope with sentences containing more than one verb and/or verbs in progressive form engine retraining improved significantly the handling of tags and spaces around tags, this is a productive achievement as it saves us a lot of manual corrections.
    13. 13. Feedback and Engine Improvement
    14. 14. “Beyond the Engine” Tools • Teaminology - crowdsourcing platform for centralized term governance; simultaneous concordance search of TMs and term bases => clean training data • Dispatcher - A global community content translation application that connects user generated content (UGC) including live chats, social media, forums, comments and knowledge bases to customized machine translation (MT) engines for real-time translation • Source Candidate Scorer – scoring of candidate sentences against historically good and bad sentences based on POS and perplexity • Corpus Preparation Toolkit – set of application to maximize data preparation for MT engine training
    15. 15. Teaminology Teaminology
    16. 16. Dispatcher
    17. 17. Source Candidate Scorer Source Candidate Scorer Compares your source content to “the good” and “the bad” legacy segments and estimates potential suitability for MT
    18. 18. Corpus Preparation Suite Variety of tools to prepare corpus for training MT engines such as: • • • • • • • Deleting formatting tags from TMX Removing double spaces Removing duplicated punctuation (e.g. commas) Deleting segments where source = target Deleting segments containing only URLs Escaping characters Removing duplicate sentences copyright © welocalize 2013. all rights reserved. www.welocalize.com
    19. 19. Corpus Preparation: TM Creator Aggregates training data from various relevant sources TM Creator
    20. 20. Corpus Preparation: TMX Splitter Extracts the relevant training corpus based on the TMX metadata
    21. 21. Welocalize Moses Implementation • Why? Far more control over engine quality since we can control corpus preparation and output post-processing • Control over metadata handling • Ties into our company open-source philosophy • Have experienced personnel in-house • Can extend and customize Moses functionality as necessary • Have connector to TMS (GlobalSight) RESULTS: In our internal tests with Moses/DoMT, we are getting automated scores similar to commercial engines for the languages into which we localize most. Same feedback received from human evaluators copyright © welocalize 2013. all rights reserved. www.welocalize.com
    22. 22. … And it works! We are in the position to offer realistic discounts and aggressive timelines providing quality levels appropriate for the content copyright © welocalize 2013. all rights reserved. www.welocalize.com
    23. 23. “Work-in-progress” Projects • Ongoing improvements to our adaptation of iOmegaT tool (Welocalize/CNGL) • Industry Partner in CNGL “Source Content Profiler” project • Adoption of TMTPrime (CNGL) - MT vs. Fuzzy Match selection mechanism • Language and content-specific pre-processing for the inhouse Moses deployment • Teaminology – adding linguistic intelligence copyright © welocalize 2013. all rights reserved. www.welocalize.com
    24. 24. Contact Language_Tools_Group_all@welocalize.com We speak MT - the language of the future Welocalize, Inc. www.welocalize.com Headquarters 241 East 4th St. Suite 207 Frederick, Maryland 21701 USA [t] +1.301.668.0330 [t] +1.800.370.9515 Toll Free [f] +1.301.668.0335 [e] marketing@welocalize.com copyright © welocalize 2013. all rights reserved. www.welocalize.com