DeepMinerIntegrating Translation Memories andMachine TranslationTEKOMOctober 25th, 2012Presenter: Daniel Benito
Introduction• History• Limitations of Translation Memory• Beyond Segment-Level Reuse – Machine Translation – Fuzzy Match Repair – Advanced Leveraging – Combining TM and MT• Current Limitations• Perspectives• Conclusion
History• Past: – 1950s – Early Machine Translation (MT) experiments – 1960s – General awareness that Machine Translation (MT) was not going to replace human translators – 1970s – First proposals for Translator Workstations – 1990s – Translation Memory (TM) became viable• Present: – TM technology has barely advanced in the last ten years – MT has advanced to the point where its applications in the translation industry are incontrovertible
Limitations of Translation Memory• Segment-level translation reuse is only useful in limited cases• Even in highly repetitive texts, most of the repetitions happen at the sub-segment level: – Terms and phrases – Sentence structure• Most Translation Memory systems are limited to providing fuzzy matches but are unable to exploit sub-segment repetition
Beyond Segment-level Reuse• We need to translate: EN: The black cat usually sleeps in the hallway.• Our TM contains: EN: The grey cat usually sleeps in the living room. DE: Die graue Katze schläft gewöhnlich im Wohnzimmer.• What can we do to reduce the time spent editing fuzzy matches? – Ignore the fuzzy matches and use MT – Automatically repair the fuzzy matches
Machine Translation• We need to translate: EN: The black cat usually sleeps in the hallway.• Results returned by various MT systems: DE: Die schwarze Katze in der Regel schläft im Flur. DE: Die schwarze Katze schläft normalerweise im Flur.• Achieving consistency and using specific terminology (e.g. Gang instead of Flur) will require some degree of training or post-editing
Machine Translation• General-purpose MT engines such as Google Translate or Microsoft Translator usually require extensive post-editing, but can be used for inspiration• Rule-based and statistical MT engines customized for specific domains offer much higher quality but require expensive tuning or retraining• It is usually more expensive to use MT than to manually edit a fuzzy match
Fuzzy Match Repair• Inspired by the translation by analogy concept from Example-Based Machine Translation (EBMT)• Attempts to maintain the quality and consistency of existing translations in the TM while increasing productivity
Fuzzy Match Repair• We need to translate: EN: The black cat usually sleeps in the hallway.• Our TM contains: EN: The grey cat usually sleeps in the living room. DE: Die graue Katze schläft gewöhnlich im Wohnzimmer.• We can replace graue with schwarze and Wohnzimmer with Gang to produce an exact match.
Fuzzy Match Repair• Requires knowing the following translations: grey → graue black → schwarze living room → Wohnzimmer hallway → Gang• What do we do if those translations are not explicitly in our TMs or termbases?
Advanced Leveraging• Bilingual concordance search: EN: The grey cat usually sleeps in the living room. DE: Die graue Katze schläft gewöhnlich im Wohnzimmer. EN: Mary has bought a new pair of grey running shoes. DE: Maria hat ein neues Paar graue Laufschuhe gekauft. EN: This article is also available in grey. DE: Dieser Artikel ist auch in grau erhältlich.
Advanced Leveraging• Statistically infer translations from the TM• Compare all of the German translations and suggest one or more probable translations (e.g. graue, grau)• Requires: – Large TMs with many examples – Consistent translations in the TM
Combining TM and MT• We can use MT as an additional resource for finding the translations needed to repair fuzzy matches• MT systems often give better results for terms and short phrases than for long sentences• We approach this combination based on the following premises: – A client’s own data is considered to be of higher quality and will always have priority over the Machine Translation results – A fuzzy match repaired with Machine Translation will usually be better than a normal fuzzy match, and better than an MT result for an entire segment
Combining TM and MT• We need to translate: EN: The black cat usually sleeps in the hallway.• Our TM contains: EN: The grey cat usually sleeps in the living room. DE: Die graue Katze schläft gewöhnlich im Wohnzimmer.• Our termbase contains: EN: grey DE: graue EN: black DE: schwarze EN: hallway DE: Gang
Combining TM and MT• We do not have the translation for living room in our TM or our termbase, so we can request it from the MT system: EN: living room DE: Wohnzimmer• The combination of material in our TM, termbase and MT system allows to perform the appropriate replacements and obtain: EN: The black cat usually sleeps in the hallway. DE: Die schwarze Katze schläft gewöhnlich im Gang.
Current Limitations• We need to translate: EN: The white dog usually sleeps in the living room.• Our TM contains: EN: The grey cat usually sleeps in the living room. DE: Die graue Katze schläft gewöhnlich im Wohnzimmer.• Our termbase contains: EN: grey cat DE: graue Katze
Current Limitations• Asking the MT system for the missing translation, we get: EN: white dog DE: weißer Hund• The result of fixing the fuzzy match is: EN: The white dog usually sleeps in the living room. DE: Die weißer Hund schläft gewöhnlich im Wohnzimmer.• Some post-editing is still required
Current Limitations• We need to translate: EN: The grey cat often sleeps in the living room.• Our TM contains: EN: The grey cat usually sleeps in the living room. DE: Die graue Katze schläft gewöhnlich im Wohnzimmer.• The translations we get from the MT system are: EN: usually DE: normalerweise EN: often DE: oft• We cannot repair the fuzzy match because we do not know how usually has been translated
Future Developments• Greater integration with the MT engines – Access to internal translation candidates: • EN: usually • DE: normalerweise, gewöhnlich, sonst, ... – Access to internal language models: • DE: Die weißer Hund – never • DE: Der weiße Hund – often – Automatic upload of new TM material to the MT engine so it can be used for retraining in the future
Conclusion• Traditional segment-level translation reuse has reached its full potential• ATRIL’s Déjà Vu X2 already includes DeepMiner technology that improves productivity by cleverly combining all the approaches we described: – (Statistical) Machine Translation – Example-Based Machine Translation – Advanced Leveraging (sub-segment matching)
Predictive Typing• Find all sub-segment matches and offer them to the translator as he or she types• Suggestions are context-sensitive, so there are never too many results to choose from• Translations are constructed piece by piece from previous texts, guided by the translator
Advanced Predictive Typing• Advanced Leveraging techniques for statistically inferring sub-segment translations from the TM can be adapted to provide additional predictive typing suggestions• Translations from MT can be added to the predictive typing mechanism, to offer additional suggestions for translations of terms and phrases
MT integrations in Déjà Vu X2• Systran Entreprise Server• Google Translate• Microsoft Translator• PROMT Translation Server• itranslate4eu