Gala Webminar September 2013

2,938 views

Published on

Pangea Machine Translation platform from Pangeanic. A product presentation by Manuel Herranz, Elia Yuste, Andi Frank showcasing the best of automated cleaning cycles, automated engine retraining, machine translation engine creation.

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,938
On SlideShare
0
From Embeds
0
Number of Embeds
2,033
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • * Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
  • * Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
  • * Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
  • * Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
  • Technology tools developed by the industry for the industry. Very “applied” “practical” philosophy
  • * Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
  • * Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
  • * Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
  • * Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
  • * Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
  • * Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
  • * Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
  • * Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
  • * Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
  • * Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
  • * Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
  • * Si alguno de estos problema causaron una demora en el programa o se deben analizar en profundidad, coloque los detalles en la siguiente diapositiva. PangeaMT
  • Gala Webminar September 2013

    1. 1. PangeaMT Manuel Herranz – Elia Yuste – Alex Helle – Andi Frank User-Empowering Data-Driven, In-Domain Machine Translation #pangeanic E: central@pangea.com.mtpangeanic
    2. 2. AGENDA • Industry reflections • Pangeanic  PangeaMT • Customization as Key Initial Servicing Step of our MT Offering • All about the PangeaMT Platform – Featuring Highlights and Demo – API : CAT Environment Integration (Demo) • Q&A Round GALA Marketplace Offer
    3. 3. ´1 ´2 1.This is an example text. Go ahead and replace it with your own text. 2.This is an example text. Go ahead and replace it with your own text. 19951995 20052005 20152015 3.This is an example text. Go ahead and replace it with your own text. 4.This is an example text. Go ahead and replace it with your own text. COST OF TRANSLATION (price/w) vs DEMAND 10-YEAR STEPS DEMAND • Price per word a valid model? • Is there an explanation? • What can we do about it? Is there a future for the Language Industry? • Unique to this industry?
    4. 4. MASSIVE AMOUNTS OF DATA – IS LANGUAGE BUSINESS MANAGEABLE? World’s data in Tb / Exa TypicalTranslationVlume 1990 1995 2000 2005 2010 2015
    5. 5. Why Machine Translation?  As of May 2009: 487 Billion gigabytes or 1,000,000,000 * 487,000,000,000 = 4,87 x 1020  Estimates  Up 50% a year (Oracle)  Doubles every 11 hours (IBM)  Humankind has stored more than 295 billion gigabytes (or 295 exabytes) of data since 1986 ComputerWorld - 2011  Researchers at the University of California, Berkeley, that found the amount of data generated from the dawn of time through 2002 was about 5 exabytes.
    6. 6. Why Machine Translation? The Data Deluge As Content Volume Explodes, Machine Translation Becomes an Inevitable Part of Global Content Strategy http://ow.ly/jVuhZ  In 2011, it took about two days for the world to create the same 5 exabytes of data that it took human eons to generate.  In 2013, it took the world just 10 minutes to create 5 exabytes.  Eric Schmidt: Every 2 Days We Create As Much Information As We Did Up To 2003 TechCrunch, 2010 The sixth power of 1,000 = 1018 1 EB = 1000000000000000000B = 1018 bytes = 1000petabytes = 1 billion gigabytes.
    7. 7. Where is data stored?
    8. 8. What can I do with MT? Machine Translation application, NEW usage and success depend on  MT for assimilation: “gisting” or “understanding“ Sports Politics Social etc Output format • Practically unlimited demand; but free web-based services reduce incentive to improve technology • Coverage + important. Instant quality  MT for dissemination: “publication“  MT for direct communication Output format Sports Politics Social etc • Publishable quality that can only be achieved by humans. MT & tools a productivity booster Output format Output format Sports Politics Social etc • Current R&D, Military uses systems for spoken MT, first applications for smartphones, online help, multilingual chat systems Output format Output format
    9. 9. 9 Short history  Pangeanic: LSP. Major clients in Asia, European localization, increasing number of languages  Need to produce translation faster, cheaper…  Experimenting with some RB MT systems  TAUS & TDA founding members  Partnering with Valencia's Computer Science Institute & Prof. F. Casacuberta / E. Vidal Research Team  Commercial implementations of PangeaMT systems at client side: SONY EUROPE, SYBASE, LSPs….
    10. 10. 10 Milestones  EU Post-editing contract 2007 (RBMT output)  Euromatrix mention  AMTA 2010  AAMT 2011/12 (JP Hybridization and MT DIY)  1st commercial platform 2010  DIY 2011 (automated re-training cycles)  SaaS Power, LocWorld Paris 2012  Improved automated cleaning cycles,  Online automated training  Regional EU R&D Funds (“Feder” x 3: 2009-2011) & Marie Curie EXPERT Project
    11. 11. Customization by the PangeaMT Team Key to achieve better qualitative results later • Top-notch human and automated service • Focused on the Client from day one! • Prior to 1st-time Engine Delivery  prior to Platform Deployment (production) • Customization concentrates on data and best engine consultancy • Data cleaning and enhancement • The impact of glossaries (in-domain, client-/product- specific…) • Reporting (your data was like this…..now let’s do this) • Training  Pangeanic tests all the development features in-house at a TRANSLATION DEPARTMENT BEFORE RELEASE.
    12. 12. Getting the data right: Automated cleaning and preparation
    13. 13. Don’t forget data cleaning!!! <tu srclang="en-GB"> <tuv xml:lang="EN-GB"> <seg>A system for recovering the methane that is emitted from the manure so that it does not leak into the atmosphere.</seg> </tuv> <tuv xml:lang="FR-FR"> <seg>Système permettant de r€ pérer le méthane qui se dégage de l'engrais naturel d'origine animale de sorte qu'il ne se dissipe pas dans l'atmosphère.</seg> </tuv> <tu creationdate="20090817T114430Z" creationid="APIACCESS" changedate="20110617T141159Z" changeid=“pat"> <tuv xml:lang="EN-US"> <seg>Overall heigtht –<bpt i="1">{f43 </bpt> <ept i="1">}</ept>25&quot;; width – <bpt i="2">{f43 </bpt> <ept i="2">}</ept>20.1&quot;.</seg> </tuv> <tuv xml:lang="ES-EM"> <seg><bpt i="1">{f2 </bpt>Altura total - 25&quot;; anchura <ept i="1">}</ept>–<bpt i="2">{f43 </bpt> <ept i="2">}</ept><bpt i="3">{f2 </bpt>20,1&quot;.<ept i="3">}</ept></seg> </tuv> </tu> <tuv xml:lang=“EN-US"> <seg>On 22nd May we decided not to join the group.</seg> <tuv xml:lang=“DE-DE"> <seg>Am 22. </seg> More cleaning Cleaning
    14. 14. Don’t forget data cleaning!!! <tu srclang="en-GB"> <tuv xml:lang="EN-GB"> <seg>The President of the United States visited Costa Rica.</seg> </tuv> <tuv xml:lang=“ES-ES"> <seg>El Presidente de los Estados Unidos, el señor Obama y su esposa la señora Michelle, visitaron Costa Rica el pasado sábado.</seg> </tuv> <tuv xml:lang=“JP"> <seg> 同書は「通訳・翻訳キャリアガイド」の 2011-2012 年度版。 英字新聞のジャパンタイムズ社が強みとするジャーナリスティックな視点で、通訳や翻訳という仕事が持つ魅 力ややりがい、プロに要求されるスキルおよび意識の持ち方などを紹介。また通訳者・翻訳者になるための道 すじから、実際の仕事の現場にいたるまで、今日の通訳・翻訳業界の実像を包括的に紹介。 </seg> <tuv xml:lang=“EN-US"> <seg>It is a journalistic point of view and strengths of the English- language newspaper Japan Times. It includes a description of the exciting and rewarding work of translation and interpretation, as well as the introduction of consciousness and how to acquire the required professional skills. The road to becoming a translator and interpreter also down to the actual work site, a comprehensive guide to interpreting the reality of today'stranslation industry. </seg> More cleaning Cleaning
    15. 15. More cleaning Cleaning Engine training with clean data Having approved, terminologically sound, clean data improves engine accuracy and performance with even small sets of data. Data cleaning modules •Remove any “suspects”: •Sentences that are too long •Mismatches (of many kinds!) •Terminological inaccuracies •Non-useful segments, etc Parallel text extraction / Translation input / Post-edited material This is often comes from CAT tools or document alignments, crawling Data Cleaning (in-lines) Remove all non-translation data. TMX Human approval Some of this material may actually be OK for training. It is then input in the training set. DATA CLEANING CYCLE (AUTOMATED)DATA CLEANING CYCLE (AUTOMATED)
    16. 16. A Success Story Sony Professional Europe, Salomé Lopez-Lavado Needs -Improve publication French, Italian, Spanish -8M words training set -time-to-market: from 3 days down to 1,5 days: html, InDesign, -Outsourcing cost: -20% -Volume: 1,5M words/year Japanese Automotive manufacturer -Spanish -8M words/year -Time to market reduced by 2 week – 3 weeks from 8 to 6 or 5 weeks -Team of 17 freelancers down to 4-7 post-editors -Outsourcing cost: -30% Spanish LSP working for banking sector -Spanish -1-2M words/year -Time to market: 1- week to 2 days!!!! -Docx, html, tmx -Down from 2-3 in- house staff and 2-3 freelancers to 2 in- house!!! http://ow.ly/peuFD Successfully applied (3d- party applications/ beneficiaries)
    17. 17. Use Case - ✔ Even with small data sets!!
    18. 18. • PangeaMT can be self-hosted when data security is critical (all processes internal to the organization) - commercially sensitive data, - financial, legal, institutional, - intelligence, knowledge-gathering, - product pre-release, etc • Control Panel + full system statistics • Re-trainings and updates by the client for data privacy / more accuracy Potential Uses of Machine Translation
    19. 19. • Information discovery: patent, unknown documents, • Automatic, on-demand creation of foreign language versions / web apps – keyword testing • multilingual crawling, data discovery • Pre-translation Other Potential Uses of Machine Translation
    20. 20. 20 Polling Questions to Audience
    21. 21. 21 Platform overview • 24/7 control over your data and engines • secure, robust and scalable • user focused (permissions and empowering capabilities) • API linked, if need be • enabled us to offer an extraordinary flexible business model - SaaS - SaaS Power (online DIY, re-trainings included) - Full Power (PLATFORM OWNERSHIP)
    22. 22. PangeaMT System – Domain Creation
    23. 23. PangeaMT System – Data Cleaning
    24. 24. PangeaMT System – Engine Creation
    25. 25. PangeaMT System – Engine Training
    26. 26. 26 PangeaMT API – SDL Plugin Demo Time (Video file)
    27. 27. Myth: MT will never be as good as humans “We cannot solve the problem using the same tools and the way of thinking that created it” A. Einstein uhmmm, it is going to get really good... 2nd stage PE material and more data make engines even more predictable. More specialist engines 3rd stage Beyond 2030... no predictions 1st stage We are creating usable engines, first PE experiences 2009-2015 or 2020
    28. 28. GALA Marketplace Offer central@pangea.com.mt Free Consultancy and Custom Engine Piloting Period October-November 2013
    29. 29. Q&A Thank you!! central@pangea.com.mt

    ×