TAUS USER CONFERENCE 2010, Man, Machine and advanced translation memory leveraging

1,254 views

Published on

Daniel Gervais, Executive Vice-president, MultiCorpora

Recent developments in TAUS Data Association super cloud-based data-sharing coupled with advanced leveraging technologies, produce measurable increases in segment matching. However, there are heated debates about how translation pollution can arise in this context, and potential antidotes for such pollution. Daniel provides cases studies to assess a central question that everyone is posing today: does increased matching through advanced leveraging technology equate to real productivity gain? Daniel's talk will provide innovative thought on new collaboration models between linguists and TM systems.

Published in: Technology

TAUS USER CONFERENCE 2010, Man, Machine and advanced translation memory leveraging

  1. 1. TAUS USER CONFERENCE 2010 LANGUAGE BUSINESS INNOVATION 4 – 6 OCTOBER / PORTLAND (OR), USA MONDAY 4 OCTOBER / 15.00 MAN, MACHINE AND ADVANCED TRANSLATION MEMORY LEVERAGING Daniel Gervais, MultiCorpora
  2. 2. Five New Technologies... ...that will change enterprise computing.  Search – the Next Generation  Environments to create Virtual Companies  Virtualization Management Consoles  Secure Cloud Creation  Management Technologies Source: Eric Lundquist, Editor-in-Chief, eWeek smartertechnology.com © 2009 – 2010 | This confidential document is the property of MultiCorpora and cannot be shared, reproduced, distributed or used without permission.
  3. 3. So, what does that mean for us? • elastic capacity • Search – the Next Generation • fault tolerant • Environments to create Virtual Companies • Scalable • Virtualization Management Consoles • Secure • Secure Cloud Creation • and easily maintained • Management Technologies Cool concepts, but...  How does this affect our industry?  How do we access them?  How do we harness them for greater productivity?  What are the real benefits?  What is the cost?  What are the best practices?  Where are they going? © 2009 – 2010 | This confidential document is the property of MultiCorpora and cannot be shared, reproduced, distributed or used without permission.
  4. 4. A brief roundup of SCbDS  Super-Cloud based Data Sharing o TDA o MyMemory o Google Translate o Grand Dictionnaire Terminologique, Termium, IATE, ... o EUR-Lex o Other multilingual public-domain sources o ... © 2009 – 2010 | This confidential document is the property of MultiCorpora and cannot be shared, reproduced, distributed or used without permission.
  5. 5. SCbDS upsides  Advances in technology support large translation memories o Build vs. Existing o Proprietary vs. shared o Public domain mining  Align large multilingual corpora  Data mine within aligned corpora  Measurable benefits have been obtained through ALTM on top of large memories  BUT THERE’S A DANGER: Translation memory pollution & too much automation! © 2009 – 2010 | This confidential document is the property of MultiCorpora and cannot be shared, reproduced, distributed or used without permission.
  6. 6. Translation Memory Pollution is...  Correctly aligned segments containing poor translation: o Inadequate editing o Poor post-mortem cleanup  Incorrectly aligned segments: o Poor alignment technology o Inadequate post-alignment proofing  Rogue tags  Correct translation of undesired content  Correct translation of obsolete source  Obsolete translation of correct source  Poor translation of poorly written source content: © 2009 – 2010 | This confidential document is the property of MultiCorpora and cannot be shared, reproduced, distributed or used without permission.
  7. 7. Translation Memory Pollution: overall conclusion Sentence-level leveraging in absence of contextual information is too simplistic and can lead to unsatisfactory results! TM ??? 3§“§%!°“§$%“§$&$&/!  © 2009 – 2010 | This confidential document is the property of MultiCorpora and cannot be shared, reproduced, distributed or used without permission.
  8. 8. The Big Question  Does increased matching through ALTM equate to REAL productivity gain?  We say YES! © 2009 – 2010 | This confidential document is the property of MultiCorpora and cannot be shared, reproduced, distributed or used without permission.
  9. 9. Here‘s why we say YES!  Large enterprise case  Large government case  Department of Justice  Medax  UNESCO  Services Canada © 2009 – 2010 | This confidential document is the property of MultiCorpora and cannot be shared, reproduced, distributed or used without permission.
  10. 10. The main problem  Wide variation of Document Types  Legacy files in PDF  No TM for certain customers Secondary problems  Content is often complex  Highly sensitive to context and style  Highly client-specific © 2009 – 2010 | This confidential document is the property of MultiCorpora and cannot be shared, reproduced, distributed or used without permission.
  11. 11. Conventional TMs Mixed Results:  No promised massive cost savings  Useful enforcement tool  Conventional terminology tool unwieldy  Excel spreadsheets preferred! Time Investment Critical  Therefore, selectivity of clients  No ability to influence clients at the authoring stage - Documents are rarely repetitive on a traditional segment model  Cost-benefit decisions: no TMs or truncated TMs © 2009 – 2010 | This confidential document is the property of MultiCorpora and cannot be shared, reproduced, distributed or used without permission.
  12. 12. ALTM addressed needs for:  Context  Matches at the paragraph, level  Matches at the segment and sub-segment levels  Interfacing/Compatibility with external vendors who used various TM tools  Better integration with terminology management, live online deployment  Server-based solution to link global production platform © 2009 – 2010 | This confidential document is the property of MultiCorpora and cannot be shared, reproduced, distributed or used without permission.
  13. 13. ALTM Benefits:  Alignment automation = Low overhead for maintaining memory  Rapid creation of larger memories = Faster project scoping and bidding  Higher probability of matches  Context provided at all times = Reduce research time  Identification of sub-expressions = Result in more matches  Terminology integration = Reduce research time, increase consistency In general, more matches reduce revision time Used to rebuild out-of-date conventional TM’s Cost-effective competitiveness © 2009 – 2010 | This confidential document is the property of MultiCorpora and cannot be shared, reproduced, distributed or used without permission.
  14. 14. Proof of proposal example  Translation Bureau RFP o For 1200 licenses o Proof of Proposal – 5 consecutive business days:  Install full client-server, 20 workstations  Create a production TM of 15 000 pairs of unstructured documents in various formats (≈ 20 M source words)  1 day - 10 people user training  1 day – production simulation use  Ensure no productivity loss - compute gains MultiCorpora won the RFP © 2009 – 2010 | This confidential document is the property of MultiCorpora and cannot be shared, reproduced, distributed or used without permission.
  15. 15. Harmonize legacy documents  Department of Justice Canada o Laws & Regulations in French and English o No harmonization of ambiguous terms o ALTM allowed to extract terminology, see the translation discrepancies in context and identify corrections o ALTM combined with terminology allowed building TermBases of ambiguous terms from process on one document, and correct in all other documents o Continuous learning process, powered by ALTM Do in computing minutes what used to take people months © 2009 – 2010 | This confidential document is the property of MultiCorpora and cannot be shared, reproduced, distributed or used without permission.
  16. 16. • German Translation Service Provider • geographically dispersed translator pool - roughly 250 doctors and pharmacists • seven full-time employees oversee processing of nearly 5 million words per year • Historically no clear TM strategy • Document types not conducive to TM • Lacklustre productivity gains vs. overhead • Discovered ROI from the terminology management and sub-segment matching • high number of shorter, domain-specific repeated sub-segment phrases  Creates hybrid, partially pre-translated documents containing “pre-harmonized” terminology to send out o 90% comes from the TermBase, created by sub-segment matches, analysis o Remaining 10% from the TextBase © 2009 – 2010 | This confidential document is the property of MultiCorpora and cannot be shared, reproduced, distributed or used without permission.
  17. 17. UNESCO  “On the Fly” translation memories o Analyse docs against all translation memories o Identify which docs and memories are the most used o Re-build specific memories from UNESCO documents, and related organisations’ documents referenced in documents o Achieve higher degree of recycling from partner organisation’s documents o Ability to recycle / harmonize domain-specific terminology by example, powered by ALTM. o Continuous improvement virtuous circle Create a TM in minutes vs. what would take months to align Add additional external content Get domain-specific terminology though sub-expressions © 2009 – 2010 | This confidential document is the property of MultiCorpora and cannot be shared, reproduced, distributed or used without permission.
  18. 18. Services Canada - Job Bank  Distinctive Hybrid translation process o 90M words per year o TM / MT / post editing o Linguistic assets comprise  Previous job offers  Domain-specific terms  Shared data increased productivity © 2009 – 2010 | This confidential document is the property of MultiCorpora and cannot be shared, reproduced, distributed or used without permission.
  19. 19. Translation Memory Pollution: Antidote  Content selection o Too much unstructured content o Need establish mining hierarchy  Use of statistics o Generate usage & translation distribution statistics per content repositories o Standardize in “live” Terminology Databases  Use human intelligence o Human needs to be involved. Too much automation only propagates pollution… o Virtuous improvement circle © 2009 – 2010 | This confidential document is the property of MultiCorpora and cannot be shared, reproduced, distributed or used without permission.
  20. 20. Other uses of ALTM  Monolingual analysis o Identify single source candidates o Identify terms to standardize o Identify deviations of customized documents from baseline texts o Identify localization order prioritization of baseline documents - 15% savings potential  TextBase repetitions  Term repetitions © 2009 – 2010 | This confidential document is the property of MultiCorpora and cannot be shared, reproduced, distributed or used without permission.
  21. 21. The Journey Is Not Yet Finished  More automation of the antidotes to pollution  Recent improvement in term extraction algorithms can expose pollution sources  Evangelization of the processes  No quick fix: Human factor remains involved. Not yet at the vision of fully automated pre-translated ALTM.  New collaboration models between linguists and TM systems  Better support for linguistic decision-making  Evangelization of the role of the post-editor © 2009 – 2010 | This confidential document is the property of MultiCorpora and cannot be shared, reproduced, distributed or used without permission.

×