Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

What is machine translation

1,576 views

Published on

AlphaCRC on MT

Published in: Software

What is machine translation

  1. 1. What is Machine Translation? GT 09/08/2013
  2. 2. Topics covered • Three types of Machine Translation • What can be translated? • Common MT systems • Which systems do our clients use? • Which system do we use?
  3. 3. Three Types of Machine Translation • Statistical Machine Translation (SMT) • Rule-Based Machine Translation (RBMT) • Hybrid Machine Translation – Rules post-processed by statistics – Statistics guided by rules
  4. 4. Statistical Machine Translation (SMT) • Developed by IBM in the early 1990s. • It is called “Statistical” because it is based on probability. • Two or three-step process: 1. Training 2. Decoding (= machine translation) 3. [Recommended] Re-training (= Improving the engine once the files have been post-edited) • Training is the critical step of Machine Translation and takes much longer than the machine translation process itself.
  5. 5. SMT – Training Process 1. Start by creating a Training Corpus – Can be one or several translation memories in TMX format – Can be a collection of source and target texts that will need to be aligned 1. Clean the corpus (automatic or semi-automatic process) – Remove duplicates (keeping the most recent entry), identical source-target segments, tags => Result is clean, text-only sentences – Can involve manual cleansing depending on the level of “noise” found 1. Build a language model from the corpus (automatic process) – Built for the target language only – Contains n-grams (group of n words) – Used to find the smoothest translation = High probability of using the correct n-gram based on its frequency in the corpus => Fluency. 1. Build a translation model from the corpus (automatic process) – Bilingual model – Contains n-grams – Used to find the best translation match = High probability that a target n-gram is the translation of a source n-gram => Accuracy.
  6. 6. SMT - Decoding Process • What people understand Machine Translation to be • A file is processed sentence by sentence • Each sentence is broken into n-grams • The n-grams are translated based on the highest probability scores in the phrase model and in the language model • The phrase is re-constructed based on the best n-grams • The file is re-constructed from all the translated phrases
  7. 7. SMT Example (ES-EN) Maria no daba una bofetada a la bruja verde Mary not give a slap to the witch green did not a slap by green witch no slap to the did not give to the slap the witch • The translation models tells us which is the more likely translation given the source words. • The language models tells us which translation is the best linguistically. Possible good translations: • Mary did not give a slap to the green witch. • Mary did not slap the green witch.
  8. 8. SMT – Re-training Process • This is an optional but recommended step. • The post-edited files are converted into a new TMX file. • The post-editors’ feedback is used to attempt to correct frequently occurring errors => Modify engine settings. • The engine is re-trained using the previous Training Corpus as well as the new TMX file.
  9. 9. SMT - Considerations • A large training corpus does not guarantee good quality MT output. • A clean and consistent training corpus must be used in order to achieve good quality MT output. • It is best to use a domain-based engine even when the client is the same, e.g. create one engine for UI and one for Help/Doc. • The quality of the MT output can vary from language to language and even from handoff to handoff. • The quality of the source text is important - Consistent terminology and sentence structure produce better output. • SMT engines can be tuned and improved with feedback. • SMT engines can be re-trained and improved by updating the training corpus with newly post-edited content.
  10. 10. Rule-Based Machine Translation (RBMT) • Based on: – Terminology • Bilingual or multilingual dictionary needed • Mono-lingual normalisation dictionary needed in order to standardise or correct source text before translation or to correct target text after translation – Rules representing the source sentence structure – Rules representing the target sentence structure – Rules on how the source structure and the target structure relate to each other • Steps: 1. Obtain part-of-speech information for each source word (article, noun, verb etc). 2. Obtain syntactic information about the verb (tense, person, voice). 3. Parse the source sentence in order to identify the structure (subject, verb, object etc). 4. Translate source words into target words. 5. Create translated sentence by mapping dictionary entries into appropriate inflected forms based on target rules. 6. [Optional but recommended] Once the post-editing is complete, update the dictionaries and/or rules based on the post-editors’ feedback.
  11. 11. RBMT - Considerations • Need very good dictionaries => Building new dictionaries is expensive because it needs to be done by a skilled linguist for each language. • The output may be accurate and grammatically correct, but not always very fluent. • RBMT engines are more expensive than SMT engines because a great deal of effort is required in terms of development and customisation before the engine produces the desired quality. • SMT engines can be re-trained automatically, whereas RBMT engines can only be updated through human intervention (update dictionaries and rules).
  12. 12. Hybrid Machine Translation Two types: • Rules post-processed by statistics – Translations are performed using a rules-based engine. – Statistics are then used in an attempt to adjust/correct the output from the rules engine. • Statistics guided by rules – Rules are used to pre-process data in an attempt to better guide the statistical engine. – Rules are also used to post-process the statistical output to perform functions such as normalization. – This approach has a lot more power, flexibility and control when translating.
  13. 13. What can be machine-translated?
  14. 14. Three File Types • Mono-lingual files (e.g. DOCX, HTML, TXT) Engines can translate mono-lingual files but this results in a mono-lingual translation => Very difficult to post-edit without reference to the source. • Translation memories in TMX format –The MT output is inserted into the target area of the translation unit. –The source files for translation are processed in a CAT tool against the MT TM, but: Penalties are applied to translation hits originating from the MT TM to indicate that the translation needs to be post-edited. • Bilingual files –The best option is to machine-translate XLIFF files. These are bilingual files than can be imported into all modern CAT tools => Post-editing can be supported by the use of a standard TM. –Machine-translated segments are flagged with a specific status in the CAT tool.
  15. 15. Which content? • Technical, structured content fares better than creative, free-flowing content – MT well suited to help systems, user guides, FAQs, Knowledge Base articles • UI strings not necessarily well suited to MT – UI strings can be difficult to interpret in standard localisation projects (omitted words for conciseness, variables, verb or noun?) => If UI strings are difficult for a human to interpret, it will be even harder for the engine – Short strings are not necessarily easier for the MT engine to decode than longer strings • Do not expect the engine to be creative – If words are not present in the Training Corpus or in the Dictionaries, the engine will not be able to come up with a translation for them => Depending on the engine, unknown words will be omitted, or left untranslated in the MT output • What level of MT output do you require? – Do you need to bring the MT output to human-quality level? – Do you simply need to be able to understand what is being said (e.g. social network sites, support chat lines)?
  16. 16. Common MT Applications (1/3) SMT Google Translate 71 languages Often translates into intermediate language and into English first to arrive at real target language, e.g. Catalan (ca ↔ es ↔ en ↔ other) CAREFUL ABOUT NDA! Microsoft Translator 39 languages • Bing Translator online • Free API up to 2 million characters per month • Offer Enterprise solutions CAREFUL ABOUT NDA! SDL Language Weaver (SDL BeGlobal) 54 language pairs Free of charge to individual translators through Trados Studio 2011, but the engine is not specific to their client or to their domain => CAREFUL ABOUT NDA! Subscription for Enterprises and LSPs Enterprises and LSPs may train their own engines via SDL BeGlobal Trainer (secure) • Make MT part of the translation workflow via WorldServer • Make MT suggestions available through the cloud via Trados Studio 2011
  17. 17. Common MT Applications (2/3) SMT Language Studio (by Asia Online) Over 500 direct language pairs • Offer on-site server installation => Licenses based on language pairs and translation volume capacity. • Offer Software as a Service (SaaS) => Pay as you go with 3 options (volume, fixed monthly fee, file size). Offer 4 levels of MT quality, all with varying degrees of customisation (and price) Customisation is carried out by Asia Online Moses Open Source (free) No language limitations Highly customisable on all levels (training and decoding) => Companies use Moses but tailor it to their needs Possible to turn it into a Hybrid system with the application of language-specific rules
  18. 18. Common MT Applications (3/3) RBMT PROMT 12 language pairs (no Asian support) •Provide a free online translator tool (Online-translator.com) •PROMT Professional (for translators) costs $265 •Offer Enterprise solution (part of translation workflow) Apertium Open Source (free) 36 languages pairs No Asian character support Hybrid SYSTRAN Have been around for 40+ years Started out as a RBMT system and has now been updated with the use of statistics 52 language pairs • SYSTRAN Premium Translator version lets you fully manage the dictionary (~ £700) • SYSTRAN Enterprise Server 7 available in three editions depending on company needs Systran say it is the fastest MT solution available
  19. 19. Timeline 2010 - Asia Online launches Language Studio, a comprehensive MT and post-editing solution. - Systran launches its enhanced Enterprise 7 MT software. - Language Weaver launches its ‘quality confidence’ module. The company is acquired by SDL. 2009 - Systran releases version 7, a hybrid version of its original RBMT. Includes an automated post-editing module. 2007 - MOSES is launched as a downloadable kit. It begins to be used in a large scale EU project (Euromatrix) to speed up the MT development of new language pairs. 2004 - The OpenTrad project funded by the Spanish government begins to develop MT engines for Spain’s various languages. Using an existing RBMT engine, the consortium builds Apertium. 2002 - Language Weaver is founded in California to develop SMT systems. 2001 - IBM launches its WebSphere translation engine for 8 languages. - The National Institute of Science and Technology (NIST) launches its first round of MT system benchmarking. 1997 - The AltaVista Babelfish service launched on the web using Systran.
  20. 20. Which MT systems do our clients use? SMT Adobe Moses – Carried out initial tests in 2009 using PROMT for Russian and Language Weaver for French and Spanish Autodesk Moses HP Language Weaver – Also have access to Microsoft Translator Oracle Moses – Switched from Language Weaver in 2012 Sybase Moses – Trained by Pangeanic in Spain RBMT PTC PROMT Hybrid Symantec SYSTRAN
  21. 21. Which system do we use? Moses hybrid (Statistics guided by rules) To be continued…
  22. 22. Thank you!

×