Bridging the Gap between Iberian Languages - MT to the rescue


Published on

A presentation given by Juan Alberto Alonso about the co-official languages in Spain and the special role Machine Translation (MT) plays in it.

Published in: Business, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Bridging the Gap between Iberian Languages - MT to the rescue

  1. 1. Bridging the Gap betweenIberian LanguagesMT to the rescueJuan Alberto Alonso04.05.2012
  2. 2. Agenda Basque and Portuguese: two special cases A success case with Catalan MT and the Iberian languages The Use of MT© Lucy Software Ibérica SL / 2
  3. 3. The Use of MT When is MT useful?© Lucy Software Ibérica SL / 3
  4. 4. When is MT Useful?  When it is adapted to the user’s specific needs:  Terminology  Document format  Linguistic peculiarities© Lucy Software Ibérica SL / 4
  5. 5. When is MT useful?  When it is properly used according to:  The translation quality delivered by the language-pair in question  The type of documents to be translated  The user environment where it has to be integrated© Lucy Software Ibérica SL / 5
  6. 6. When is MT Useful?  When it is well integrated into the user’s document flow:  CMS, proxies, etc.  Press agencies and newspapers  Translation agencies© Lucy Software Ibérica SL / 6
  7. 7. The Uses of MT Dissemination Assimilation (Production) (Information) Translation quality does not need to be Very high MT quality very high For closely For languages linguis- related languages tically more distant Can be integrated Useful to break into very complex user language barriers environments (from 0% to X%)© Lucy Software Ibérica SL / 7
  8. 8. Languages: Multilingualism, projection androle in the world
  9. 9. Official policies: Linguistic politics  Spanish is the official language in Spain, next to four co-official languages:  Basque  Catalan/Valencian  Galician  Portuguese: New linguistic normative toward the international unification of language.© Lucy Software Ibérica SL / 9
  10. 10. MT with the Iberian Languages: A Unique Case Dissemination Assimilation (Production) (Information) Translation quality Very high MT quality does not need to be For closely very high related languages For languages linguis- Can be integrated tically more distant into very complex user Useful to break environments language barriers (from 0% to X%)© Lucy Software Ibérica SL / 10
  11. 11. MT with the Iberian Languages: A Unique Case Dissemination Assimilation Political Factors Translation quality Very high MT quality does not need to be For closely The promotion of very high minority languagesFor languages linguis- related languages is a Can be integratedpolitical issue and istically more distant supported by local Useful to break into very complex user environments Governments language barriers (from 0% to X%) Need for huge translation volumes© Lucy Software Ibérica SL / 11
  12. 12. Castilian, Catalan and Galician: An Ideal Scenario for MT  The translation quality yielded by MT among Castilian, Catalan and Galician is very high (above 95%)  Through a ramp-up phase where the MT system is adapted to the user’s needs, this quality can become even better.  The daily “normal” use of Catalan and Galician is officially encouraged and supported by the corresponding local Governments© Lucy Software Ibérica SL / 12
  13. 13. Castilian, Catalan and Galician: An Ideal scenario for MT  There is a real and constant need of translation for huge documentation volumes between Castilian and Catalan (less for Galician).  MT has been used for years in productive complex environments for Castilian-Catalan (newspapers, translation agencies, Public Administrations, etc.), with millions of words MT-translated and post- edited on a daily basis and therefore...  There exists a year-long culture for productive MT use, with users and post-editors trained to use these systems. This is probably a unique case in the World© Lucy Software Ibérica SL / 13
  14. 14. Castilian, Catalan and Galician: An ideal scenario for MT Dissemination Political Factors Very high MT quality The promotion of For closely minority languages is a related languages political issue and is Can be integrated supported by local into very complex user Governments environments Need for huge translation volumes© Lucy Software Ibérica SL / 14
  15. 15. A Success Case for Spanish-Catalan: La Vanguardia  La Vanguardia is the leading newspaper in Catalonia, and one of the main newspapers in the rest of Spain, with an average daily circulation of over 200.000 copies. It is widely recognized as a quality newspaper.  Starting May 3rd 2011, La Vanguardia now has two parallel editions, one in Spanish and another in Catalan.© Lucy Software Ibérica SL / 15
  16. 16. La Vanguardia: The Challenge  3 Options  Given the task of making bilingual daily editions of a newspaper, three possible options could be considered:  The “MT-less” option: Using no MT at all  The “full-MT” option: “Only” using MT  The “sensible-MT” option: Using MT + customization + human post-editors© Lucy Software Ibérica SL / 16
  17. 17. La Vanguardia: The Challenge The “MT-less” Option  Duplicate the whole editorial human team OR/AND hire a team of N human translators to translate the entire newspaper content on time in order to keep both editions synchronized for publishing.  Duplicate most of the IT infrastructure  Given all these factors, the question arises of whether it would be feasible to produce bilingual editions of a newspaper this way because of  Dramatic increase of costs  Very tight time constraints© Lucy Software Ibérica SL / 17 CONFIDENTIAL
  18. 18. La Vanguardia: The Challenge  The “full-MT” Option  Run all the contents of the base edition through an MT translation system.  Publish the raw MT-translation of the original contents in the other- language edition.  Obviously, this is not an option because, even for language-pairs for which the quality of MT is very high (as it is the case for Spanish- Catalan, > 95%), the output mistakes would be unacceptable for publishing (proper nouns being translated, homographs, etc.) and the resulting Catalan style would not always sound “natural” to Catalan speakers.© Lucy Software Ibérica SL / 18
  19. 19. La Vanguardia: The Challenge  The “sensible MT” Option  Customize the MT-system to the specific linguistic needs of the newspaper (style guide, corporate terminology, proper nouns, etc.)  Integrate the MT-flow within the newspaper editorial flow (document and character formats, connection to a post-edition environment, feedback processing, etc.)  Incorporate a post-edition environment to be used by a team of human post-editors into the editorial flow.  Here we have a compromise between the MT-use (time and effort saving) and the translation quality.© Lucy Software Ibérica SL / 19
  20. 20. Requirements from La Vanguardia  One daily copy of La Vanguardia includes over 60.000 words, all of them to be translated, revised and post-edited.  The Catalan edition should comply with the linguistic requirements stated in the Style Guide of La Vanguardia.  Both editions should be ready for printing every day at 23:30 the latest.  Currently, most journalists at La Vanguardia write in Spanish, which is now the base edition, out of which the Catalan edition is created, but  At short/mid-term every journalist will be free to write in the language of his/her choice (Catalan or Spanish), so that, actually, there will be no base edition.  Both the MT-system and the post-edition environment should be completely integrated into their editorial flow (both IT-integration and human team integration).© Lucy Software Ibérica SL / 20
  21. 21. How the MT-System was Customized for La Vanguardia  Computational linguists, post-edition experts, and La Vanguardia editorial team worked together for six months in order to  Customize the MT-system to their linguistic requirements (as far as possible)  Over 20.000 lexical entries added/changed in the MT-system lexicons  Around 440 rules adapted in the MT-system grammars.  Integrate the MT-system into their IT editorial environment.  Integration with HERMES CMS.  La Vanguardia specific character format and XML tag handling  Inclusion of markups specifically designed for post-editors  Translation performance to meet the translation load & peaks requirements.  A team of around 15 persons has been trained on post- editing the MT-output before publishing.© Lucy Software Ibérica SL / 21
  22. 22. La VanguardiaConclusions  Producing two parallel bilingual editions of a daily newspaper only seems to be feasible if:  MT is used  MT is properly customized, adapted and integrated to the newspaper linguistic and IT requirements.  There is a team of trained specialized human post-editors who correct MT mistakes and “give the human flavor” to the output.© Lucy Software Ibérica SL / 22
  23. 23. Portuguese: A Different Scenario  Portuguese is one of the Iberian languages with a high-level business potential (both in Portugal and Brazil/South America)  The translation quality given by MT-Systems between Portuguese and Spanish is very high (similar to the one among Castilian, Catalan and Galician.  However, in the case of Portuguese, the key factor is the Business needs and opportunities and not the political drive.© Lucy Software Ibérica SL / 23
  24. 24. Portuguese: a Different Scenario Dissemination Market Needs There is a wide Market asking for Very high MT quality quality translation For closely between Portuguese related languages and Spanish Can be integrated into very complex user Need for huge environments translation volumes© Lucy Software Ibérica SL / 24
  25. 25. Basque: yet Another Different Scenario EU: Iberiar Penintsulako hizkuntzen artean euskara kasu berezia da ES: El vasco es un caso particular entre las lenguas de la Península Ibérica CA: El basc és un cas particular entre les llengües de la Península Ibèrica GL: O vasco é un caso particular entre as linguas da Península Ibérica PT: O basco é um caso particular entre as línguas da Península Ibérica EN: Basque is a special case among the languages of the Iberian Peninsula© Lucy Software Ibérica SL / 25
  26. 26. Basque: yet Another Different Scenario Dissemination Assimilation Translation quality Enough MT quality for does not need to be restricted domains very high For closely For languages linguis- related languages tically more distant Political Can be integrated Useful to break Factors into very complex user language barriers environments (from 0% to X%) The promotion of minority languages is a political issue and is supported by local Governments Need for huge translation volumes© Lucy Software Ibérica SL / 26
  27. 27. Basque: yet Another Different Scenario  Basque is a special case among the Iberian languages:  It is not an Indo-European language. It is linguistically very different from the rest of Iberian languages (and, incidentally, also from any other human language).  The MT translation quality between Basque and Castilian, Portuguese, Galician or Catalan will be lower than the one obtained among the latter four.  Adapted for restricted domains, the MT quality can be sufficient for productive use.  Its daily “normal” use is being encouraged and supported by the Basque Government.  The use of MT to translate from Basque into Castilian, Catalan, Portuguese, Galician or English is a good example of assimilation use (breaking language barriers).© Lucy Software Ibérica SL / 27
  28. 28. Lucy Basque MT Portal  The first MT-systems with Basque already exist and new ones will be developed at short/mid-term© Lucy Software Ibérica SL / 28
  29. 29. Questions?© Lucy Software Ibérica SL / 29
  30. 30. Thank you for your attention! Juan A. Alonso Lucy Software Ibérica© Lucy Software Ibérica SL / 30