Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Why MT Matters


Published on

LUSPIO presentation at LTAC Conference in Rome April 2011

Published in: Technology, Business
  • Be the first to comment

Why MT Matters

  1. 1. Why Machine Translation Matters Trends & Best Practices Kirti Vashee – http://kv-emptypages.blogspot.comCopyright © 2009, Asia Online Pte Ltd
  2. 2. A Content Explosion Across The Globe The Emergence of Social Media and Social Networking as Business Drivers and Influencers New Open Innovation & Collaboration Business Models The Increasing Importance of Technology & Automation A Rising Asian Market Changing Global Enterprise PrioritiesCopyright © 2009, Asia Online Pte Ltd
  3. 3. More information was created in 2005 than in the previous 40,000 years ! Total Exabytes of Information 40,000 35,000 30,000 25,000 20,000 15,000 2009 = 800 Million Petabytes 10,000 1 PB = 1,000,000 GB 5,000 0 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020Copyright © 2009, Asia Online Pte Ltd Source: IDC Digital Universe Study, May 2010
  4. 4. More content will be translated than ever before By 2012, Enterprises will be processing and managing 15 times more data than in 2007. Each year the amount of information created in the enterprise, paper and digital combined, grows faster than 65%. IDC In 2012,there will be 5X as many bits created and added to the Digital Universe as in 2008. The Digital Universe will double every 12 to 18 monthsCopyright © 2009, Asia Online Pte Ltd
  5. 5. The Impact of User Generated Content 70% of Digital Universe is UGC User Creation; Enterprise Worries• Growing influence of social networking and social media• Users share opinions about Enterprise products, services and User Generated companies Content Touch Content• Users trust other user opinions Consumers and more than they trust corporate Overlap Transported, Workers Creating, marketing messages ~ 600 Hosted, Managed Capturing or Exabytes or Secured• Word of Mouth Marketing Replicating Personal (WOMM) is now an important Information element of marketing strategy• Huge impact on buying behavior ~900 Exabytes ~960 Exabytes• Twitter as Customer Support• Dynamic and Uncontrolled Size of Digital Universe in 2010 – 1,200 ExabytesCopyright © 2009, Asia Online Pte Ltd
  6. 6. As these conversations become increasingly independent of these sites, falling traffic will render them ineffective in their current form. Instead, the online presence of each brand will necessarily expand out into the social space to stay in touch with their audience. Simon MainwaringCopyright © 2009, Asia Online Pte Ltd
  7. 7. 10X 30X Customer X Exceptions Communities Assisted Self-Service Support Center Web Portal User Initiated Groups Knowledge Base Community Conversations Development/ Product Engineering ManagementCopyright © 2009, Asia Online Pte Ltd Source: Consortium for Service Innovation
  8. 8. Jul-Sep 07 Jul-Sep 08 Apr-Jun 09 FY08Q2 FY09Q2 FY10Q1 2% 2% 3% 5% 27% 37% 60% 93% 71% 2,895,302 Customer interactions 6,609,817 Customer interactions 8,002,883 Customer interactions Community activity Cisco Hewlett-Packard Microsoft Oracle Self-service activity Symantec Yahoo! Dell Apple Intuit Assisted – new case activity Mentor Graphics Novell VeriSign RIM Alcatel BMC Deutsche BankCopyright © 2009, Asia Online Pte Ltd Introduction Source: Consortium for Service Innovation
  9. 9. Customer Corporate Investment Interactions and focusDirect Support 95% 1-3% Activity Assisted Support Activity 10,000 @ $250/case 5-9% Self-Service Support Activity 100,000 @ $10/exception Community Support 90-95% of 300,000 @ $1/exception? Activity Indirect SupportCopyright © 2009, Asia Online Pte Ltd Source: Consortium for Service Innovation
  10. 10. Evolution from the G7 to the G20 World Fast growing Asian economies and BRICI offer the fastest growing global market opportunities and could reduce and supersede FIGS dominance in future Top Ten Languages (by users) in the Internet • McKinsey : 700+ Million New Asian Users will English 478 come online over next 5 years and represent Chinese 384 $80B+ market for infrastructure & commerce Spanish 137 • China 770M, India 350M Users in 5 years ! Japanese 96 French 79 • Fastest growing languages on the Internet: ZH, Portuguese 73 AR, RU, HI, ID, BrPt, MY, PH & Indic languages German 65 • BCG: BRICI will have 1.2+ Billion Online by 2015 Arabic 50 Russia 45 • Cisco Study: Most growth in the Internet-related Korean 37 market will occur outside of todays high income, All the Rest 290 or "advanced," economiesMillions of Users 0 50 100 150 200 250 300 350 400 450 500 • Fastest growing digital consumer populations will be in Asia and Brazil 42% of all Internet users in 2009 were Asian. Forecast to grow to nearly 60% by 2015 LabBrand: China is the biggest luxury market opportunity in a generation McKinsey: China is on track to pass the United States as the home of the world’s largest R&D workforce Copyright © 2009, Asia Online Pte Ltd
  11. 11.  Global enterprises face a content deluge with dynamic content coming from both internal and external sources  High volumes of content expected to be translated increasingly faster and faster  Customers increasingly in control of marketing and brand messages  A shift from corporate messaging to customer conversations and authentic communications More Content, Faster Turnaround Times, Lower CostCopyright © 2009, Asia Online Pte Ltd
  12. 12. Now, more than at any other time in history, speed and agility are decisive competitive advantages... David Meerman Scott In revolution, the best of the new isincompatible with the best of the old. It’s about doing things a whole new way…Copyright © 2009, Asia Online Pte Ltd Clay Shirky
  13. 13.  What We Translate – More Dynamic Real-Time Content  Why We translate – From Mandatory to Increase and Expand Communication with Customers  How We Translate – More Automation, MT and Open Collaboration Models  Highly Personalized Content to Customers when they need it in a variety of digital forms More Content, Continuous, Faster Turnaround, Cheaper Project Based TEP  Continuous StreamsCopyright © 2009, Asia Online Pte Ltd
  14. 14. Low Volume, Static Content Corp Project Management Product Cost Minimization Product Packaging Sheets TEP Production Modes Basic Marketing Web / User Focus on Formatting Interface (GUI) Basic Web Content User Documentation The Target Customer Localization Departments, Marketing Support Production Model TEP (Translate > Edit > Proof) Key Technologies Translation Memory, TMS, Email Trados, déjà vu, Wordfast, TMS, Idiom, MS Office Key Objectives SimShip, Customer Quality Acceptance, Formatting Content Volatility Relatively Static, Linked to Product Updates Integration with Customer Little if ever (CMS) SystemsCopyright © 2009, Asia Online Pte Ltd
  15. 15. • Static Reference • Real Time Search & Find • Human Filtered Material Mode Information • Long Shelf Life • Information acquired as • Expert Identification • Just In Case needed • Trust agent based • Mandatory and • Comprehensive & information gathering necessary dynamic knowledge base • Continuously flowing and • Information flow from • Continuously Updated changing company to consumerCopyright © 2009, Asia Online Pte Ltd
  16. 16. Interactive Support: EMAIL Knowledge Base Data Instant Messaging User Manuals Voice User Generated Support Content Blogs Documentation • Web 2.0 is much more interactive and dynamic • Unstructured content in blogs, social networks is critical • Community engagement and collaboration is key Dynamic & Continuously Flowing ContentCopyright © 2009, Asia Online Pte Ltd
  17. 17. Copyright © 2009, Asia Online Pte Ltd
  18. 18. Human Example Words Corporate Brochures 2,000 Corporate Product Brochures 10,000 Products User Interface Software Products 50,000 User Documentation Manuals / Online Help 200,000 Existing Focus New Markets HR / Training / Reports 500,000 Enterprise Information Communications Email / IM 10,000,000 Support / Knowledge Base Call Center / Help Desk 20,000,000+ User Generated Content Blogs / Reviews 50,000,000+ Machine Problem: Only 0.5% of what needs to be translated today is being translated due to cost and time constraints. TEP process slow and expensive. Solution: Machine translation offers a potential boost that could produce “good enough” quality for many applications.Copyright © 2009, Asia Online Pte Ltd
  19. 19. General Purpose Customized • Goal is to get a general • Goal is to produce near-human understanding quality • Generic systems that are built from • Tuned for the language style and public domain data domain of a single customer • Basic quality translation but intended • Built with customer data for wide applicability • Much higher accuracy and • Focus = Broad but shallow translation quality • Google, Babelfish, MSN Live and • Focus = Narrow but deep other free sites • Optimized for a specific customer • Quality is only for gisting and general defined domain understanding • Matched to a specific purpose • One size fits all • Quality can be publication ready • Loss of ownership • Secure data, private system • Privacy • No volume limits • Limits to volume • Complete Control and Openness • Black BoxCopyright © 2009, Asia Online Pte Ltd
  20. 20. Data Preparation Data Cleaning Translate Training Combined Data Collections Diagnostic & Fine TuningLanguage PairFoundation Data Quality AssuranceDomainFoundation Data Original Translation Sources Client Custom Domain Data • Near-human quality translation quality is possible by combining : • Asia Online’s Language Pair Foundation Data (516 language pairs to choose from) • Domain Foundation Data (15 domains per language pair) with data from the clientCopyright © 2009, Asia Online Pte Ltd
  21. 21. Key Human Feedback Correct Targeted Corrections Mistranslation of Bad Learning Syntax/Grammar Terminology Spelling and Correct Spelling Terminology Punctuation Correct Initial System Correct Correct Human Feedback can raise the raw output to previously unseen quality levelsCopyright © 2009, Asia Online Pte Ltd
  22. 22. Linguistic Steering Pattern Identification, Corpus Analysis, Linguistic Problem Solver, Quality Assessment, Linguistic Asset Development and Test & Tuning Set Development MT-Savvy Translators & Editors Rapid Error Identification / Correction Manufacture Corrective Data and Drive Early Development of MT Engines Less Skilled Editors to Correct Target Language Content Can be Monolingual, Students, Housewives Monolingual Data Cleanup N-gram Resolution and PreparationCopyright © 2009, Asia Online Pte Ltd
  23. 23.  Corpus Analysis & Preparation  Pattern Identification  Linguistic Structural Analysis  Linguistic Problem Solving  Linguistic Production Process Management  Translation & MT Engine Quality Assessment  Rapid Quality Assessment  Effective Use and Development of Automated Measurements  Steering Guidance to MT Developers  Rapid Error Detection & Correction  Open minded translators  Better translator workbenches and tools  Skilled monolinguals with subject matter expertise (SME)  Community Management  Recruiting  Quality ManagementCopyright © 2009, Asia Online Pte Ltd
  24. 24. Initial System put into production Changes are collected and Trained Internal Experts added to initial corpus to drive begin initial clean up and continuous retraining correction process All users allowed to suggest Expert Users also changes which goes through allowed to make vetting process changes Publication Quality Target Post-editing effort and cost can be managed by Quality Post Editing Effort improving the quality and performance of the MT engine via corrective linguistic feedback Raw MT Quality 1 2 3 4 5 6 Engine Learning IterationCopyright © 2009, Asia Online Pte Ltd
  25. 25. Sales / Marketing Blogs Product CRM Management Biz Intelligence TMS Content ECM Management BPM CRM Customer Email Support IM The Global Customer Continuous Improvement SMT Hybrid Engines • Continuous Evolution Translation Systems • Integration with content creation and content management tools • Better standards to facilitate flow and data interchange • Tighter integration into corporate business systemsCopyright © 2009, Asia Online Pte Ltd
  26. 26. Content Type Target Quality Process Volumes Legal, Marketing, High Human Translation, Low Mandatory TEP Reference, KB Moderate Custom MT + High Professional Post- Editing User Generated Moderate to Low Custom MT + Very High Content Community Post- Editing Random Corporate Low - Gisting Custom Corporate High Content MT Random Web Low - Gisting Free MT 150 Billion Words in Content Google Translate in 2010 Match the production process to the value, volume and quality configuration of the contentCopyright © 2009, Asia Online Pte Ltd
  27. 27. Internal • Product Training Materials Prioritize for Translation Process • Manuals & Documentation Corporate • Design & Research Develop Linguistic Profiles of Key Content • Sales & Marketing Content • Emails & Website • Training Materials Build and Leverage Linguistic Assets for External Partner • Customer Feedback Translation Production Lines & Customer • Customer Care & Support Different Target Quality: TEP, MT+ Post Editing, • Customer Blogs & Forums Custom MT, Raw Corporate Baseline MT Content • Social Network Content Customize MT Engine Communicate Translate & Simplify & Clean Refine MT Engine Listen & Learn Distribute & Share Post-Edit and Analyze Source CorrectCopyright © 2009, Asia Online Pte Ltd
  28. 28. Revolutionize the Internet Revolutionize the enterprise experience for non-English translation process with a speakers in Asia comprehensive, continuous learning SMT platform Provide 1 billion+ local-language pages online SaaS environment that allows data cleaning using mostly translated open license content, and preparation, develop SMT engines on combined with compelling portal and social demand and enable ongoing comprehensive networking style services in Thailand, post editing and correction to continuously Indonesia, India, Malaysia, Philippines, improve engines Vietnam and China, Japan & Korea The Consumer Market The Enterprise Market SE Asian Languages PFIGS -CJK Large Corporate Buyer & Translation Tools Vendor Publisher Perspective PerspectiveCopyright © 2009, Asia Online Pte Ltd
  29. 29. 6,000 Finland Key Size of circle reflects relative amount of Denmark Asia Japan Sweden Europe annual R&D 5,000 Americas spending by country Scientists & Engineers per Million People noted. Norway Others USA Canada Taiwan Singapore 4,000 Australia Switzerland Russia Belgium Germany 3,000 Korea Ireland Austria France Spain Netherlands United Kingdom 2,000 Portugal Hungary Poland Italy Israel South Africa 1,000 Turkey Malaysia China Mexico Brazil Thailand India 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 R&D as a Percentage of GDP Source: R&D Magazine (Reed), Battelle, World Bank, OECD, K4D, UNESCO, Strat-etech ConsultingCopyright © 2009, Asia Online Pte Ltd
  30. 30. English Mock-up of Thai Wikipedia project that was launched in January 2011 with funding support by Thai Ministry of ICT Is already the 4th busiest site in Thailand and should be the top site by the end of the yearCopyright © 2009, Asia Online Pte Ltd
  31. 31. Copyright © 2009, Asia Online Pte Ltd
  32. 32. Copyright © 2009, Asia Online Pte Ltd
  33. 33. Copyright © 2009, Asia Online Pte Ltd
  34. 34. Copyright © 2009, Asia Online Pte Ltd
  35. 35. • What content has the greatest value for our target audience? • How do we get it translated quickly, at the highest quality possible at the lowest cost possible? • How do we build infrastructure that enables emerging new content to be quickly translated as needed? • From localization projects to flowing streams of high value customer related contentCopyright © 2009, Asia Online Pte Ltd
  36. 36. Grazie per l’attenzione Kirti Vashee – Follow Me on Twitter: @kvashee Join the Automated Language Translation Group in LinkedInCopyright © 2009, Asia Online Pte Ltd