Preparing your content for machine translation
Upcoming SlideShare
Loading in...5
×
 

Preparing your content for machine translation

on

  • 1,339 views

Topics we will cover: ...

Topics we will cover:
• What content is suitable for Machine Translation (MT)
• When quality really matters – and when it doesn’t
• What are the different approaches to MT
• What approach is best suited to what content type
• Why Rules-Based MT approaches care about grammar and why Statistical approaches don’t
• Working with your technical writers
• The 10 highest-impact authoring rules for each approach

We will also present the results of our study into the productivity gains that result from pre-editing content, and the ROI on this approach to improving MT efficiency.

Takeaways:
• How to select content for machine translation
• How to develop a content plan that includes MT in the mix
• How to decide which technology suits your content
• How to deal with vendors
• How to prepare content to improve machine translation effectiveness
• What guidelines to give your technical writers

Statistics

Views

Total Views
1,339
Views on SlideShare
1,338
Embed Views
1

Actions

Likes
2
Downloads
8
Comments
0

1 Embed 1

http://www.slideee.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Introductions: 30 minutes – Including Who are you? Why are you here? What do you hope to achieve?
  • Or talk about Content Rules just before you start your first section?
  • How many of you have more content to translate this year than a few years ago?
  • How many of you feel the pressure for speed?
  • Let me check to see if this is still true.How many of you have a bigger budget for localization next year?
  • What percentage of your company’s sales are international?Do you happen to know what percentage of your company’s budgets are spent refining your English language content vs content for their international markets? (My customer in France, 60% of sales abroad, spent 90% of their time refining their French content)
  • What percentage of your company’s sales are international?Do you happen to know what percentage of your company’s budgets are spent refining your English language content vs content for their international markets? (My customer in France, 60% of sales abroad, spent 90% of their time refining their French content)
  • Opening up your content to a more international audienceTranslating new types of content that may have no budgetSpeeding up the translation processImproving consistencyGrowing global salesSupporting international customers betterImproving customer satisfaction
  • What did we learn? We learned that machine translation is only as good as the humans who work on it
  • Note on Board
  • This leads me to an important point. It is really important that you understand the three different ways of handling global content and the impact of each.
  • Local expectations are important. The classic story here is Gerber Baby Food. In the US, we sell Gerber Baby Food in jars with pictures of a fat baby with puffy cheeks on the label. This meets our cultural expectations. In Africa, the cultural expectation – because literacy is low – is that the picture on the outside of the jar is of the ingredients. So a picture of a baby with fat cheeks says that the baby food consists of … you guessed it … ground up baby! A lesson Gerber learned very, very painfully.
  • Val, if you want to explain this (here are the full rules)Quick Overview:Spell-check Make sure there are no grammatical errorsUse complete, grammatically correct sentencesDo not use more than 25 words per sentenceUse short, simple, declarative sentences Use the active voiceKeep phrases as close as possible to the word they modifyAvoid embedded clauses introduced by commas or dashes Avoid parenthetical expressions in the middle of a sentenceAvoidunusualpunctuationUse a question mark only at the end of a direct questionUse ‘could’ with ‘if’Repeat the head noun in ambiguous sentences (‘I clicked on the link and then I…’ not, ‘I clicked on the link and…’)When writing in English, avoid the –ing gerund (‘how to print your document’ rather than ‘printing your document’)When appropriate, use an article or this/that before a nounDo not omit the relative pronouns ‘who’, ‘which’ or ‘that’Do not use pronouns which have no specific referentDo not make noun clusters of more than 3 nounsAvoid ambiguity ('the back button and the forward button’ rather than 'the back and forward buttons’).Avoid colloquialisms (example: 'Two clicks and you’re good to go').Avoid dangling modifiers or ending sentences with prepositionsUse approved words from the glossaryUse the same term consistently, without synonymsAvoid abbreviations that are not in the glossaryKeep both parts of a two-part verb togetherDo not use slashes to list lexical itemsUse a hyphen to indicate the first part of a compoundTag texts such as addresses and proper names to prevent translation.Make your instructions as specific as possible Write in a Simple, Clear StyleAvoid complexity, extra clauses, ambiguous phrases, sentence fragments and unnecessary words.Concise, Clear and to the PointDo not leave out necessary grammatical elements and do use complete short sentences.Do Not Leave Out Necessary WordsThe English language allows us clearly convey our intentions even when we omit certain words, such as relative pronouns (who, whom, that, which), prepositions, and parts of verbs. In other languages these words are required and must be included in documents that will be translated.Avoid using Slang or IdiomsAvoid idioms or slang in documents you plan to translate because these can differ from country to country. Terms commonly used in the US will not accurately translate for use in other countries. Example of an idiom: take the wind out of one's sails.Use Proper PunctuationMake sure you use correct punctuation. Punctuation is a guide for computer translation software. Without correct punctuation, sentences can be interpreted in several ways.Accurate Spelling is very importantCheck your spelling and use your spell-checker before sending the text to be translated. If you give the MT engine incorrect information, your translation will be misinterpreted. Using a grammar checker will identify some of the most ambiguous phrases.Use Articles Whenever PossibleAn article is a word used to indicate a noun and to state its purpose. For example, in English, the definite article is 'the' and the 'indefinite articles are 'a' and 'an'. Use of articles reduces ambiguities and gives you better translation results.Use Terminology and Abbreviations ConsistentlyAlways use the same word, phrase or abbreviation to describe the same element or action each time that they appear in a document. Inconsistent wording can cause confusion for both humans and computers.
  • Can you read this aloud? Do you notice the second “the”? You make an automatic correction, but your MT engine won’t
  • Or this? You might not realize it, but your brain is a code-cracking machine. MT cannot fill in the blanks like your mind can, however.
  • Improving MT quality at source will also make your post editors very happy.
  • Overview bullets of case studyIntroduction, Reasons for study
  • Hypothesis
  • Scenario
  • Results
  • Scenario
  • Analysis
  • Analysis
  • Scenario
  • Scenario
  • Additional things to impact results even more
  • GregGiven these characteristics of support content, MT is proving to be a huge asset.How do you deal with the dynamics of support content and the characteristics of support content. MT is a huge asset in a company’s support strategy.MT is just about as good as MT
  • And this is the part that is so revolutionary! If you look at “quality” MT won’t stack up, but if you look at the only quality that’s important to the customer at this point – having the information solve their problem – then MT is nearly as successful – around a 5% difference for a fraction of the cost!Knowledge Base Resolve Rate* - Users more willing to accept language imperfections in exchange foruseful informationEven when it’s perfect, customers prefer it to no MT. Microsoft found that with Japanese MVP (Most valued Professional) Customers, MT helped only 83% of the time, but 92% wanted more MT content
  • What about buy in from all the stakeholders?

Preparing your content for machine translation Preparing your content for machine translation Presentation Transcript

  • Preparing Your Content for Machine Translation Val Swisher Founder & CEO Content Rules, Inc. @ContentRulesInc LinkedIn/in/ValSwisher Lori Thicke Founder & President LexWorks @LoriThicke LinkedIn/in/LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Agenda for today‟s workshop         Introductions Why Should MT Be Part of Your Content Strategy? A Brief History of MT & Machine Translation Approaches Preparing your Content for MT: Global Readiness Pre-Editing Case Study What MT Approach Best Suits My Use Case? Raw or Post-Edited: How Do You Like Your MT? Implementing MT: ▫ Executive Buy In ▫ Insource or Outsource? @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Introductions @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Introductions: Lori Thicke  MFA from the University of British Columbia  Moved to Paris to write the Great Canadian novel  Instead founded Lexcelera (France) & LexWorks (Canada)  Along the way, founded Translators without Borders  Specialty areas:  Languages  Editing  Technology agnostic machine translation  Passion: Access to information for all, whatever their language  Writing first book about language and development issues @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • About LexWorks        Parent company, Lexcelera, was founded in Paris in 1986 First in France ISO 9001:2000 certified for quality Deploying MT since 2007 LexWorks established in Canada in 2010 “T-Shaped”: technical + linguistic Engine agnostic Customize and maintain all types of MT engines and support users @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Introductions: Val Swisher  Founder and CEO of Content Rules  Board member of Translators without Borders  Specialty areas:  Global content strategy  Terminology management  Content quality / Global readiness  Single-sourcing / XML / DITA  Finishing third book, “Global Content Strategy,” due out in May 2014 @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • About Content Rules  End-to-End content lifecycle firm  Founded in 1994  Specialize in  Global Content Strategy  Structured Content Strategy  Content Creation  Content Quality Assurance  Global Readiness  Authorized Service Provider for Acrolinx  Exclusive Licensee of The Rockley Method™ @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • How about you?       Name Company Title Responsibilities Using Machine Translation? Hoping to Learn Today @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Why MT should be part of your content strategy: Content is exploding!  A week‟s worth of The New York Times contains more information than an 18th century person encountered in a lifetime  More unique information will be generated this year than in the previous 5000 years  More internet messages are sent in one day than the population of the world  Symantec created more content last year than in the previous decade  Do you have more content than a few years ago? @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • And that‟s not to mention User-Generate Content (UGC)  UGC is the fastest-growing segment of the internet  Hundreds of millions of blogs  Nearly 40% of US companies use blogs for marketing purposes  100 hours of video are uploaded to YouTube every minute  25% of search results are links to user-generated content  Companies are integrating their users in customer support  Is UCG content part of your content strategy? @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Increasing pressure for speed     Real-Time Chat Instant support Sim Ship  Do you need to deliver faster this year than 10 years ago? @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Budgets are being squeezed  What does your localization budget for next year look like? @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Companies are more international than ever before  What % of your sales are international? @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • It‟s just Common Sense  52% of consumers only buy from a web site in their own language  In France and Japan, that figure increased to more than 60%  Consumers who did not speak any English are 6 times more likely to avoid English web sites altogether  64% said they would pay more for a product if they could get information about it that they could read “Can‟t Read Won‟t Buy” @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • MT is about more than reducing localization budgets Where do you want machine translation to take you? Discussion @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • A brief history of machine translation Or, rather, @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Machine translation is all about human intervention  Training and customization  Full post-editing for full publication quality  Light post-editing for understandable quality @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  •  What do you already know about MT?  How do you feel about it? Discussion @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Rules-Based MT (RBMT)  Built-in linguistic rules  Trained by domain-specific dictionaries Examples:  Systran, Lucy, Reverso, ProMT @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Statistical MT (SMT)  Trained by millions of segments  Bilingual and monolingual  Similar to target content Examples:  Moses, Safaba, Bing/Microsoft Translator Hub, Google, Asia Online, SDL Language Weaver/Be Global @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Online SMT  Online engine, available free or on subscription, trained on web content Examples:  Bing/Microsoft Translator Hub, Google Translate @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Hybrid SMT  Combines two approaches, usually through multiple passes Examples:  Systran, Asia Online @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Advantages of RBMT      Respects grammatical rules: Le chat vert User dictionaries control terms System can be improved in near to real time Large corpora not necessary Works best for morphologically rich languages @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Disadvantages of RBMT      Customization requires trained linguists Not available for every language Labour intensive Not feasible to add new languages High licensing costs @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Advantages of SMT       Learns automatically Easy to add languages Limited human input needed More fluid sentences Can cover multiple domains Available in open source @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Disadvantages of SMT      Requires large corpora to build Requires strong processing resources Unpredictable Longer time between updating cycles No grammatical rules to govern text it hasn‟t seen: le vert chat @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Advantages of Online SMT      Trained on massive data In-domain and out-of-domain Handles colloquial speech and errors well Wide range of languages, even rare ones Low cost; fast implementation @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Disadvantages of Online SMT  Confidentiality (unless an agreement)  Out-of-domain terminology can muddy results  Unpredictable @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Advantages of Hybrid MT  Usually combines the best of both worlds  SMT component can be trained on limited corpora @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Disadvantages of Hybrid MT  Requires both linguists and processing resources  Not available (yet) in open source @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Suitable content for MT MT can be a powerful tool for:            Documentation Procedures Specifications Internal reports Intranets Customer support User Generated Content Intelligence/Analytics E-discovery Patents Alerts, updates, news @ContentRulesInc @LoriThicke But not so much for: o o o o o o Advertisements Marketing materials Health information Contracts Speeches Literary content © 2014. Content Rules, Inc., LexWorks All rights reserved.
  •  What kind of content do you have? Discussion @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • How translation is priced # of words * price per word * # of languages For example: 100,000 words * $.15 per word * 8 languages = $120,000 @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • What is translation memory?  Database  Stores source segment and translations  Used in translation process  If segment has already been translated, you don‟t pay for any additional instances as long as no words have changed @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • To save money – trim words (and stay close to the sentences in your TMs) Use fewer words Our example: Trim 10% of words = 10,000 words New calculation: 90,000 words * $.15 /word * 8 languages = $108,000 Savings: $12,000 @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • How machine translation is priced: Training + Post-editing  Training cost:  Included in word price  Separate line item  Volume DIY  Post-editing cost: # words * price per word  Savings from around 150,000 words per language amortizes the engine training @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Training inputs for machine translation Rules-Based (RBMT):  Ready to use “off the shelf”  Customization via user dictionaries (encoded), normalization glossaries and „do not translate‟ lists  TMs can be used for text extraction Hybrid:  All of the above @ContentRulesInc @LoriThicke Statistical (SMT):  Train for language + domain  Bilingual corpora (usually millions of segments)  TMs  Monolingual corpora  Glossaries Online SMT:  Ready to use “off the shelf”  Customize with Statistical inputs, as above (but less needed) © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Human translation vs machine translation  Human translator:  Translation: Average 2500 words per day  Post-editor:  Full post-editing: Average 5000 words per day  Light post-editing: 20,000 words per day or more @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Preparing your content for MT  A global readiness strategy is even more important for machine translation than for human translation … because computers aren‟t as smart  @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • A definition Global readiness is the process of creating and optimizing content so that readers all over the world can grasp its meaning and intent. - Scott Abel  Most writers in the U.S. do not understand how to create global-ready content @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Methods of handling global content  Translation  Localization  Transcreation @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Methods of handling global content Translation Localization Transcreation The content stays the same The meaning stays the same Different content developed to meet business objectives Language Literal word-for -word translation of everything Translate the meaning of the words in a way that is culturally appropriate Developed in local language; English may be used as part of the brand vocabulary Images No change Change to fit local expectations / product needs Change to fit local expectations / product needs Layout No change Minimize changes Change to fit local expectations Brand Vocabulary No change No change Enhance and expand @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Global readiness  Focuses on:    Readability Grammar and style Content reuse  Ensures that content is understood by people who have English as a second language  Ensures that content is translatable  Eliminates culturally-insensitive visual elements @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Translation and global readiness  Shorten time to translate  Improve translation accuracy  Increase success of machine translation  Decrease post-edit rework and costs  Increase readability for all languages, including English @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Options? Fix the content before translation @ContentRulesInc @LoriThicke Fix it in 15 languages © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Post-editing as a share of total costs Initial Project Subsequent Project @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Global readiness 101 Reduce Word Count Increase Use of Identical Sentences @ContentRulesInc Decrease Word Variability @LoriThicke Reduce Sentence Length and Complexity Fix Translationspecific Word Usage Errors © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Writing for MT: specifics  Do not leave out necessary words  Use articles (the, a, an) and relative pronouns (who, which, that)  Use terminology and abbreviations consistently  Use the active voice  Avoid embedded clauses introduced by commas or dashes  Avoid parentheses  Avoid the „ing gerund  Avoid noun clusters @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Reduce word count Documents littered with wasteful phrases a great deal of much due to the fact that because @ContentRulesInc @LoriThicke 75% Word Count Reduction 80% Reduction © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Wasteful words everywhere        a great deal of  much due to the fact that  because despite the fact that  although arrive at a conclusion  conclude we suggest that you use  use give careful consideration to  consider It can be seen that  - a great deal of  @ContentRulesInc due to the fact that @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Increase sentence reuse  The start date must be before the end date.  End date must be greater than or equal to start date.  End date must be equal to or later than the start date.  End date should be greater than start date.  The start date cannot be later than the end date.  Start date must be before end date.  Your end date must be after your start date.  Your start date must be before your end date.  The end date must be later than or the same as the start date.  The actual end date must be on or after the actual start date.  Please enter an end date that is later than or the same as the start date. @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Decrease word variability “The weather outside is hot.” (But not to a team of writers…) @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Decrease word variability The weather outside is...        baking blazing blistering boiling burning febrile feverish @ContentRulesInc        fiery flaming heated hot humid like an oven on fire @LoriThicke        ovenlike parching roasting scalding scorching searing sizzling        smoking steaming sultry sweltering torrid tropical very warm © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Word usage errors  Slang, idioms, ambiguous words  Right vs. correct  “My word!”  Modal verbs  Split infinitives: “to boldly go”  Verb tenses  Dogs will always chase cats vs. Dogs always chase cats  Pronouns create gender issues  It was hot (what is hot? Male? Female?)  This creates a problem (what is this?) @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • For emaxlpe, it deson‟t mttaer in waht oredr the ltteers in a wrod aepapr, the olny iprmoatnt tihng is taht the frist and lsat ltteer are in the rghit pcale. The rset can be a toatl mses and you can sitll raed it wouthit pobelrm. S1M1L4RLY, Y0UR M1ND 15 R34D1NG 7H15 4U70M471C4LLY W17H0U7 3V3N 7H1NK1NG 4B0U7 17. @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Example: Limit the length of sentences Original Text Alerts are displayed on alert list windows, which provide tools and information to aid users as they determine whether alerts represent suspicious activity that should be reported to authorities. Des alertes sont montrées sur les fenêtres de listes des alertes, qui fournissent des outils et des informations aux utilisateurs d'aide pendant qu'elles déterminent si les MT alertes représentent l'activité suspecte qui output devrait être rapportée aux autorités. Edited Text Alerts are displayed in alert list windows. The alert list windows provide tools and information that help users determine whether alerts indicate suspicious activity that should be reported to authorities. Des alertes sont montrées dans des fenêtres de listes des alertes. Les fenêtres de listes des alertes fournissent les outils et les informations qui aident des utilisateurs à déterminer si les alertes indiquent l'activité suspecte qui devrait être rapportée aux autorités. Les alertes s‟affichent dans des fenêtres de listes des alertes. Les fenêtres de listes des alertes fournissent les outils et les informations qui aident des utilisateurs à After PE déterminer si les alertes indiquent une activité suspecte qui devrait être signalée aux autorités. @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Example: Avoid the gerund -ing Original Text Understanding the differences between owned and checked out alerts is critical to understanding SAS Anti-Money Laundering, La compréhension des différences entre les alertes possédées et MT Extraites est critique au SAS Antioutput Money Laundering de compréhension, Edited Text In order to understand SAS Anti-Money Laundering, you need to understand the differences between owned alerts and checked out alerts. Afin de comprendre le SAS Anti-Money Laundering, vous devez comprendre les différences entre les alertes détenues par un autre utilisateur et les alertes bloquées. Afin de comprendre le fonctionnement de SAS Anti-Money Laundering, vous devez After PE comprendre les différences entre les alertes détenues par un autre utilisateur et les alertes bloquées. @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Example: Move prepositional phrase closer to what it modifies Original Text Edited Text In the Available Alerts window, click Click Check Out in that alert's Availability Check Out in the alert's Availability column on the Available Alerts window. column. Le clic Extraient dans la colonne Dans la fenêtre Alertes MT Disponibilité de cette alerte sur la fenêtre disponibles, le clic Extraient dans output Alertes disponibles. la colonne Disponibilité de l'alerte. After PE Dans la fenêtre Alertes disponibles, cliquez sur Extraire dans la colonne Disponibilité de l'alerte. @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Example: Limit the use of passive voice Original Text Risk-factor-only alerts can be identified by the contents of the Scenario and Triggering Values columns on an alert list window. Des alertes de type facteur de risque uniquement peuvent être identifiées par le contenu du scénario et des colonnes MT Valeurs de déclenchement sur une output fenêtre de listes des alertes. Edited Text For a risk-factor-only alert, the Scenario column of the alert list window displays either ML_Risk or TF_Risk. Pour une alerte de type facteur de risque uniquement, la colonne Scénario de la fenêtre de listes des alertes montre ML_Risk ou TF_Risk. After PE Pour une alerte de type facteur de risque uniquement, la colonne Scénario de la fenêtre de listes des alertes indique ML_Risk ou TF_Risk. @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • BTW, post-editors really hate bad MT
  • Case study How Does Pre-Editing Affect the Quality of Machine Translated Content? @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Hypothesis Pre-editing source content, specifically for machine translation will improve the quality of MT output and lessen the need for post-editing. @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Scenario  Started with 10,751 words in English  MT engines previously trained for English into French  Checked English source content using Acrolinx – collected metrics  Corrected Acrolinx edits – collected metrics  Edited English content second time – collected metrics  Ran edited content through MT engine – collected metrics on postediting productivity, error typology and quality score @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • English metrics pre/post source editing Baseline @ContentRulesInc Acrolinx-only @LoriThicke Re-edit © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Editing details Score Spelling Grammar Style Words Sentences Before 348 116 12 246 10751 975 Acrolinxonly 151 109 8 45 10708 1025 Acrolinx + edit 140 101 5 45 10789 1073 @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Most common edits Acrolinx flags  Long sentences  Use this/that/these/those with a noun  Modal verbs Additional edits  Remove „-ing‟ words  Eliminate possessives  Eliminate colloquialisms (eg., “lay of the land”) @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Methodology  To avoid adding a bias into at the post-editing step, new source files were made by mixing the 3 kinds of source text (no preediting, automatic pre-editing, human pre-editing)  SMT & RBMT engines processed the source files  3 post-editors assigned, 1 per reconstituted file  Post-editing carried out within CAT tool with no TMs or glossaries  3 quality measures were used:  Automatic metrics: GTM (General Text Matching) using SymEval (http://sourceforge.net/p/symeval/wiki/Documentation/)  Post-editing productivity (words per hour)  Linguistic evaluation: categorization of errors made by the engine @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • RBMT Results (Improvement in GTM quality score) Pre-editing type RBMT GTM Score No Pre-editing 53.9 Automatic pre-editing 56.78 +2.88 Auto. pre-editing + human edit 56.89 + 2.99 Dif. with no pre-editing RBMT GTM Score 58 57 56 55 54 53 52 51 50 No Pre-editing @ContentRulesInc @LoriThicke Automatic pre-editing Auto. pre-editing + human edit © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • SMT Results (No improvement in GTM) PE0 PE1 PE2 GTM Score 54.3 54.19 51.5 Dif. with no pre-editing -0.11 -2.8 GTM Scores SMT engine 55 54.5 54 53.5 53 52.5 52 51.5 51 50.5 50 PE0 @ContentRulesInc @LoriThicke PE1 PE2 © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Conclusions  The automatic pre-editing shows improvement in GTM score and post-editing productivity for RBMT  On this type of source text, automatic pre-editing improved GTM scores by 4 points and productivity by 14% (with other source texts we‟ve seen 30% increases)  Further pre-editing (adding human effort) did not result in significant improvements of indicators  No impact, or even a degradation of results, was seen with SMT @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Analysis  Pre-editing boosts quality and post-editing productivity for rulesbased and hybrid systems because they parse sentences for meaning and pre-editing makes the sentences clearer  There is little impact on SMT because training is based on non preedited content @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • To have an even bigger impact  Re-use as much as you can from your memories, which will form the basis of training your MT engine  (And clean them up from time to time)  Author with Global Readiness 101 guidelines  Stay consistent with your terminology  Use the MT approach that best suits your content  Establish pricing with your vendors that incorporates productivity improvements over time @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  •  What resources/rules do you give your writers? Style sheets? Terminology lists? Tools? Discussion @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • What MT approach best suits my use case?  Rules-Based, Statistical, Hybrid  How to determine which approach suits which content, context and language @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Content types Content Type & Other Considerations Online SMT RBMT  Documentation, reports, online help, UI Hybrid SMT  FAQs, forums, UGC   Patents, other broad domain   Marketing materials Insufficient in-domain/out-of-domain data  Poor grammar, spelling  @ContentRulesInc @LoriThicke    © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Language Combinations Language Considerations (Sample) French, Spanish, Italian Online SMT Hybrid RBMT SMT       Russian, Japanese, German Norwegian, Danish, Thai @ContentRulesInc @LoriThicke   © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Features Area Feature Capability Number of languages handled out of the box Capability Cost RBMT Ability to add language pairs (if training data available) Free or Open Source version exists (e.g. Apertium, Moses) Cost SaaS (Software-As-A-Service) models exist Training Training Training Training Training Learns automatically Rapid improvement cycle Can be trained by engineers Can be trained by linguists Effective with limited training data @ContentRulesInc @LoriThicke SMT ~20 ~50           © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Features Area Feature Quality Quality Output is fluent Output is predictable Pre-editing significantly improves output quality Quality Quality Quality Quality Quality RBMT     Can handle poor grammar or spelling Uses specified terminology applying correct grammar Handles software tags without special programming Can be integrated with TM (Translation Memory) tools @ContentRulesInc @LoriThicke SMT     © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Features Area Feature Suitability Better performance with online chat Better performance with UGC and broaddomain content (e.g. patents) Suitability Suitability Suitability Suitability RBMT   Better performance for documentation/UI and other narrow domain content Suited to rare language pairs (where training data is available) Suited to full post-editing/real-time improvement cycles @ContentRulesInc @LoriThicke SMT    © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Case study Alstom Transport Factory, Novocherkassk Russia Challenge: •English to Russian •2 pages to 100 pages daily •Over 3 years •Technical specifications, contracts •No bilingual data at project start Technology Used: •RBMT @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Case study eDiscovery Looking for that Smoking Gun Challenge: •Japanese to English •30,000 pages, mostly emails, technical reports, meeting minutes •Poor grammar, colloquial Technology Used: •Hybrid @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Case study SNCF, Response to Technical RFP Challenge: •French & English to Braz. Portuguese •3400 pages, lots of different documents •Multiple passes on same file •Tender response •Limited data at project start Technology Used: •Hybrid @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Case study Customer Support Online Forums and FAQs Challenge: •9 languages •Dynamic content •Poor grammar, colloquial •Solve problems before help desk •Need 24/7 availability Technology Used: •Online SMT @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Case study BNP‟s Self-Service MT Server Challenge: •World‟s 3rd largest bank •Centralized MT server for 200,000 employees •Multiple business units with their own terminology •Self-Service •Behind the firewall; we train & update remotely Technology Used: •Hybrid @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Raw or post-edited: How do you like your MT?  Fully post-edited to human quality: documentation, reports, etc.  Lightly post-edited: news items, intranet, alters  Raw or lightly post-edited: intelligence/analytics, ediscovery, customer support @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • How good is good enough? What REALLY matters to customers in support content is: Most important  Technically accurate  Problem-Solution description  Causes of problem  Complete information  Quickly found  Clarity of content  Valid hyperlinks  Product and version  Configuration information Least Important  Legal disclosures  Date last used  Frequency of usage  Punctuation  Grammar  Complete sentences  Correct spelling From the Consortium for Service Innovation @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Some customer support stats  Customers report greater satisfaction with machine translation vs. no translation  Microsoft: 23% found machine-translated articles to be useful vs. 29% for human translation  Intel: raw MT content was only 3% less successful in answering customer questions (44% vs. 47% @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • MT is nearly as useful as a human translation And sometimes even more so… HT (English) 64% Source: Martine Smets Microsoft Customer Support
  • Benefits of MT for customer support Speed  Allows for Just-in-time translation to speed information to customers  30% (post-edited) to 100% (raw) faster Cost  “Sufficient to solve” for a fraction of the cost of human translation (30% to 90% cheaper)  Reduces calls to help desk (“call avoidance”) 20%-50% @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Benefits of MT for customer support (cont.) Supports customer engagement strategies  Self-service; empowers the customer  Increases information customers can access (from ≃10% to 100%)  Increases customer satisfaction (“Did this solve your problem”)  Expands markets – low cost of MT enables serving more markets  Enables many-to-many  Is international customer support an issue for you? @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Implementing MT: Getting executive buy-in The Problem  Executives, in general, lack basic understanding of the value of local language  Perception that localization is just operational expenditure, and thus success is only defined by cost savings  Data to prove value (revenue enablement, customer satisfaction) is difficult to obtain and harder to prove @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Implementing MT: Getting executive buy-in (cont.) Change the narrative  Competitive advantage – building tighter customer relationships, speaking to customers in their language, building communities/social media buzz  ROI: Customer engagement, satisfaction, scalability in new markets, no lost market opportunities @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Implementing MT: Resources  Expertise  Linguists  Post-editors  Bilingual and monolingual corpora  Engineers  Processing power  What resources do you have access to? @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Implementing MT: Insource or outsource? @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Questionnaire: Insource or outsource? The impact of MT on your business strategy is . . . A Useful B Vital The in-house cost for MT compared to the marketplace is . . . A High B Low Ideally, your MT processes would be . . . A Discrete B Integrated Your technological maturity is . . . A Low B High Your in-house MT capability compared to the marketplace is . . . A Low B High @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Customer view of MT: A Customers are concerned with the outcome, not the process B Customers are concerned with the process of MT Capabilities to execute MT: A Capabilities and assets are available in the market from qualified providers B MT requires specialized capabilities and assets, not easily found outside the company Technological requirements: A The technology is either very stable with limited applications or very dynamic, changing quicker than the rate of adaptation B The technology is relatively fluid and possessing it can be a clear advantage @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • World-class ability: A Resources to achieve world-class are not available. B Resource and capabilities exist to retain/achieve world-class performance. Capability vs. alternative: A External vendors are clearly more competent. B A leadership position exists within your company. Effort to close gaps: A Significant capital and resources are required to close gaps B The internal source provides a clearly competitive cost advantage over external suppliers, rate of improvement is high. Length of commitment: A You plan to harvest or exit the business in the near future B A long-term planning horizon exists. @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  •  Do you have executive buy-In?  Do you need it?  What are the ways to get buy-in from all stakeholders? Discussion @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • MT quality IS possible… “Contrary to all expectations, using MT in Bentley has improved the translation quality in the pilot projects.” French OLH reviewer: “I give a 9…I find this translation very good…I found it better than the translations I used to see before.” German courseware reviewer: “It was the best translation of courseware I ever read.” @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Questions @ContentRulesInc @LoriThicke © 2014. Content Rules, Inc., LexWorks All rights reserved.
  • Thank you! Val Swisher Founder & CEO Content Rules, Inc. Lori Thicke Founder & CEO LexWorks © 2014. Content Rules, Inc., LexWorks All rights reserved.