Your SlideShare is downloading. ×
Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 2
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 2

250

Published on

Bilingual sentence-aligned parallel corpora, or bitexts, are a useful resource for solving many computational linguistics problems including part-of speech tagging, syntactic parsing, named entity …

Bilingual sentence-aligned parallel corpora, or bitexts, are a useful resource for solving many computational linguistics problems including part-of speech tagging, syntactic parsing, named entity recognition, word sense disambiguation, sentiment analysis, etc.; they are also a critical resource for some real-world applications such as statistical machine translation (SMT) and cross-language information retrieval. Unfortunately, building large bi-texts is hard, and thus most of the 6,500+ world languages remain resource-poor in bi-texts. However, many resource-poor languages are related to some resource-rich language, with whom they overlap in vocabulary and share cognates, which offers opportunities for using their bi-texts.

We explore various options for bi-text reuse: (i) direct combination of bi-texts, (ii) combination of models trained on such bi-texts, and (iii) a sophisticated combination of (i) and (ii).

We further explore the idea of generating bitexts for a resource-poor language by adapting a bi-text for a resource-rich language. We build a lattice of adaptation options for each word and phrase, and we then decode it using a language model for the resource-poor language. We compare word- and phrase-level adaptation, and we further make use of cross-language morphology. For the adaptation, we experiment with (a) a standard phrase-based SMT decoder, and (b) a specialized beam-search adaptation decoder.

Finally, we observe that for closely-related languages, many of the differences are at the subword level. Thus, we explore the idea of reducing translation to character-level transliteration. We further demonstrate the potential of combining word- and character-level models.

Published in: Internet
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
250
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Statistical machine translation (or SMT) systems learn how to translate from large sentence-aligned bilingual corpora of human-generated translations.
    We often call such kind of corpora bi-texts.
    A well-known problem with the current SMT systems is that collecting sufficiently large training bi-texts is very hard, so most languages in the world are still resource-poor for SMT.
    To solve this problem, we want to adapt a bi-text of a resource-rich language to improve machine translation for a related resource-poor language.
  • Let’s start with an introduction first.
  • Statistical machine translation (or SMT) systems learn how to translate from large sentence-aligned bilingual corpora of human-generated translations.
    We often call such kind of corpora bi-texts.
    A well-known problem with the current SMT systems is that collecting sufficiently large training bi-texts is very hard, so most languages in the world are still resource-poor for SMT.
    To solve this problem, we want to adapt a bi-text of a resource-rich language to improve machine translation for a related resource-poor language.
  • We want to reuse bi-texts from related resource-rich languages to improve resource-poor SMT.
    Why does this work?
    Because many resource-poor languages are related to some resource-rich languages.
    And related languages often share overlapping vocabulary and cognates.
    They often have similar word order and syntax.
  • We want to reuse bi-texts from related resource-rich languages to improve resource-poor SMT.
    Why does this work?
    Because many resource-poor languages are related to some resource-rich languages.
    And related languages often share overlapping vocabulary and cognates.
    They often have similar word order and syntax.
  • We want to reuse bi-texts from related resource-rich languages to improve resource-poor SMT.
    Why does this work?
    Because many resource-poor languages are related to some resource-rich languages.
    And related languages often share overlapping vocabulary and cognates.
    They often have similar word order and syntax.
  • There are many resource-rich and resource-poor languages which are closely related.
    [CLICK]
    In our work, we focus on the pair, Malay and Indonesian.
    We also show the applicability of our method to another language pair: Bulgarian and Macedonian.
  • We want to reuse bi-texts from related resource-rich languages to improve resource-poor SMT.
    Why does this work?
    Because many resource-poor languages are related to some resource-rich languages.
    And related languages often share overlapping vocabulary and cognates.
    They often have similar word order and syntax.
  • Here is our main focus: improve Indonesian-English SMT using additional Malay-English bi-text.
  • Malay and Indonesian are closely related languages.
    A native speaker of Indonesian can understand Malay texts, and vice versa.
    Here are two example sentence pairs which show the similarity: about 50 percent of the words overlap.
    So, we can train an SMT system on one language and apply it to the other directly: there are matching words and short phrases.
  • We asked a native Indonesian speaker to adapt the same Malay sentences into Indonesian while preserving as many Malay words as possible.
    As a result, the overlap reached 75 percent.
    [CLICK]
    Our goal is to do this automatically: adapt Malay to look like Indonesian.
    Then, we can use this adapted bi-text to improve Indonesian-English SMT.
  • Suppose we have a small Indonesian to English bi-text, which is resource-poor.
    And we also have another large bi-text for Malay-English, which is resource-rich.
    Our method has two steps.
    [CLICK]
    The first step is bi-text adaptation.
    We adapt the Malay side of the Malay-English bi-text to look like Indonesian.
    [CLICK]
    The second step is bi-text combination.
    We try to combine the adapted bi-text with the original small Indonesian-English bi-text in order to improve Indonesian-English SMT.
    [CLICK]
    Note that we have no Malay-Indonesian bi-text.
  • The first step is bi-text combination: adapting a Malay-English bi-text to Indonesian-English.
  • Given a Malay-English sentence pair
    We first adapt the Malay sentence to look like “Indonesian” using word-level and phrase-level paraphrases, and cross-lingual morphology.
    Then, we pair the adapted “Indonesian” sentence with the English sentence of the Malay-English sentence pair.
    [CLICK]
    Finally, we can generate a new Indonesian-English sentence pair.
  • For example,
    given a Malay sentence,
    [CLICK]
    we generate a confusion network.
    In the confusion network, each Malay word is augmented with multiple Indonesian word-level paraphrases.
    [CLICK]
    Then we decode this confusion network using a large Indonesian language model.
    Thus, a ranked list of some adapted “Indonesian” sentences is obtained.
  • For example,
    given a Malay sentence,
    [CLICK]
    we generate a confusion network.
    In the confusion network, each Malay word is augmented with multiple Indonesian word-level paraphrases.
    [CLICK]
    Then we decode this confusion network using a large Indonesian language model.
    Thus, a ranked list of some adapted “Indonesian” sentences is obtained.
  • After that, we pair each adapted “Indonesian” sentence with the English counter-part for the Malay sentence in the Malay-English bi-text.
    [CLICK]
    We thus end up with a synthetic “Indonesian”–English bi-text.
  • How do we find the Indonesian word-level paraphrases for a Malay word?
    We use pivoting over English to induce potential Indonesian paraphrases for a given Malay word.
    First, we generate separate word alignments for the Indonesian–English and the Malay–English bi-texts.
    If a Malay word ML3 and an Indonesian word IN3 are both aligned to the same English word EN3,
    [CLICK]
    then, we consider the Indonesian word IN3 as a potential translation option for the Malay word ML3.
    [CLICK]
    each translation pair is associated with a conditional probability in the confusion network.
    The probability is estimated by pivoting over English.
    [CLICK]
    Note that we have no Malay-Indonesian bi-text, so we pivot over English to get Malay-Indonesian translation pairs.
  • Since the Indonesian-English bi-text is small, so its word alignments are unreliable.
    As a result, we get bad Malay-Indonesian paraphrases from the word alignments.
    [CLICK]
    we try to improve the word alignments using the Malay-English bi-text. Since Malay and Indonesian share some vocabulary, we combine the Indonesian-English and Malay-English bi-text to carry out word alignment. As a result, we obtain an improved Indonesian-English word alignment.
    When we concatenate the Indonesian-English and the Malay-English bi-text, we concatenate multiple copies of the small Indonesian-English bi-text. The reason is that the Malay-English bi-text is much larger than the small Indonesian-English bi-text.
  • The second issue is that
    Since the Indonesian-English bi-text is small, the Indonesian word-level paraphrases for a Malay word are restricted to the small Indonesian vocabulary of the small Indonesian–English bi-text.
    [CLICK]
    to enlarge the small Indonesian vocabulary, we use cross-lingual morphological variants.
    Now let me explain how we add cross-lingual morphological variants to a confusion network.
    If the input Malay sentence has the word seperminuman, we first find its lemma minum, and then determine all Indonesian words sharing the same lemma.
    These Indonesian words are considered as the cross-lingual morphological variants for the Malay word.
    [CLICK]
    Note that here the Indonesian morphological variants are from a large monolingual Indonesian text, so there are new Indonesian words which are not in the small Indonesian-English bi-text.
  • Word-level pivoting ignores context.
    It relies on the Indonesian language model to make the right contextual choice.
    [CLICK]
    We also try to model the context more directly by generating adaptation options at the phrase level using pivoted phrase tables.
    We use standard phrase-based SMT techniques to build two separate phrase tables for the Indonesian–English and the Malay–English bi-texts.
    Then we pivot the two phrase tables over English phrases.
    The obtained pivoted phrase table is used to adapt Malay to Indonesian.
    We also add cross-lingual morphological variants to enlarge the Indonesian vocabulary.
    [CLICK]
    As a result, we can model the context better by using both Indonesian language model and phrases.
    Another advantage is that we can have more word operations here, since we use phrases.
  • Recall that the second step of our method is bi-text combination.
  • We combine the original small Indonesian–English bi-text with the adapted “Indonesian”–English bi-text in three ways:
    [CLICK]
    The first way is to simply concatenate the two bi-texts as the training bi-text.
    In this way, we assume the two bi-texts have the same quality.
    [CLICK]
    The second way is called balanced concatenation.
    Since the adapted bi-text is much larger than the original Indonesian-English bi-text, the adapted bi-text will dominate the concatenation.
    In order to overcome this problem, we repeat the smaller Indonesian–English bi-text enough times so that the amounts of the two bi-texts are the same before concatenation.
    [CLICK]
    Finally, we experiment using a method for combining phrase tables proposed in the previous work of Nakov and Ng. This method can improve word alignments and then combine phrase tables with extra features.
  • I will now present our experiments.
  • In our experiments, we use the following datasets.
    For Indonesian–English: we have a small training bi-text, a development set, and also a test set.
    We also use a large Malay–English bi-text, which is then adapted into Indonesian-English.
  • We have carried out two kinds of experiments:
    The first kind is called isolated experiments.
    In isolated experiments, we only use the adapted bi-text but not the original Indonesian-English bi-text.
    These experiments provide a direct comparison to using the original bi-text.
    The green bars show the two baseline systems.
    Although the original Malay-English bi-text is about 10 times bigger than the original Indonesian-English bi-text, training on the Malay-English bi-text is much worse than training on the small Indonesian-English bi-text.
    This shows the existence of important differences between Malay and Indonesian.
    Using our method, we can see that word-level paraphrasing improves by 5 BLEU points over the original Malay-English baseline.
    And it improves by close to one BLEU point over the original Indonesian-English baseline.
    By adding cross-lingual morphological variants to word-level paraphrasing, we get about half a BLEU point of improvement. This confirms that the cross-lingual morphological variants are actually effective.
    As we discussed before, phrase-level paraphrasing can model context better, so phrase-level paraphrasing gets larger improvement.
    Finally, we use the system combination method, MEMT, to combine the best word-level paraphrasing system and the best phrase-level paraphrasing system, and it yields even further improvements.
    This shows that the two kinds of paraphrasing methods are actually complementary.
  • The second kind of experiments is combined experiments.
    In these experiments, we try to combine the adapted bi-text with the original Indonesian-English bi-text using the three bi-text combination methods.
    Similar to the isolated experiments, we get improvements using both word-level and phrase-level paraphrasing methods. This is consistent with the isolated experiments.
    One interesting discovery is that using our method, the results of the three bi-text combination methods do not differ so much as the baselines.
  • To summarize, this graph shows the overall improvements that we obtain in our experiments.
    The first three bars are the baselines using existing methods,
    And the fourth one is our best isolated system, which improves about 1 BLEU point over the baselines.
    The last one is the best combined system, and it gives us 1.5 BLEU point improvement over the baselines.
  • We have also applied our method to other languages.
    We try to improve Macedonian-English SMT by adapting a Bulgarian-English bi-text.
    We get similar results.
    This confirms the applicability of our method to other language pairs.
  • While Indonesian is closely related to Malay, there are also some false friends.
    They share some words, but the words may have very different meanings in the two languages.
    That’s why we paraphrase all the words in our experiments.
  • We asked a native Indonesian speaker who does not speak Malay to judge whether our adapted “Indonesian” sentences are more understandable to him than the original Malay input.
    It turns out that they are similar to the Indonesian speaker.
    The adapted sentences did work better than the original Malay sentences in our experiments.
    We think there can be two reasons for this:
    The first one is that SMT systems can tolerate noisy training data;
    The second reason can be that the judgments were at the sentence level, while phrases are sub-sentential; there can be many good of them in a “bad” sentence.
  • We also tried to adapt Indonesian to Malay, and then use a Malay-English translation system to translate the adapted Malay sentences to English.
    However, the results turned out to be worse than adapting Malay to Indonesian.
  • Some related work.
  • We have also applied our method to other languages.
    We try to improve Macedonian-English SMT by adapting a Bulgarian-English bi-text.
    We get similar results.
    This confirms the applicability of our method to other language pairs.
  • We have also applied our method to other languages.
    We try to improve Macedonian-English SMT by adapting a Bulgarian-English bi-text.
    We get similar results.
    This confirms the applicability of our method to other language pairs.
  • We have also applied our method to other languages.
    We try to improve Macedonian-English SMT by adapting a Bulgarian-English bi-text.
    We get similar results.
    This confirms the applicability of our method to other language pairs.
  • We have also applied our method to other languages.
    We try to improve Macedonian-English SMT by adapting a Bulgarian-English bi-text.
    We get similar results.
    This confirms the applicability of our method to other language pairs.
  • We have carried out two kinds of experiments:
    The first kind is called isolated experiments.
    In isolated experiments, we only use the adapted bi-text but not the original Indonesian-English bi-text.
    These experiments provide a direct comparison to using the original bi-text.
    The green bars show the two baseline systems.
    Although the original Malay-English bi-text is about 10 times bigger than the original Indonesian-English bi-text, training on the Malay-English bi-text is much worse than training on the small Indonesian-English bi-text.
    This shows the existence of important differences between Malay and Indonesian.
    Using our method, we can see that word-level paraphrasing improves by 5 BLEU points over the original Malay-English baseline.
    And it improves by close to one BLEU point over the original Indonesian-English baseline.
    By adding cross-lingual morphological variants to word-level paraphrasing, we get about half a BLEU point of improvement. This confirms that the cross-lingual morphological variants are actually effective.
    As we discussed before, phrase-level paraphrasing can model context better, so phrase-level paraphrasing gets larger improvement.
    Finally, we use the system combination method, MEMT, to combine the best word-level paraphrasing system and the best phrase-level paraphrasing system, and it yields even further improvements.
    This shows that the two kinds of paraphrasing methods are actually complementary.
  • The second kind of experiments is combined experiments.
    In these experiments, we try to combine the adapted bi-text with the original Indonesian-English bi-text using the three bi-text combination methods.
    Similar to the isolated experiments, we get improvements using both word-level and phrase-level paraphrasing methods. This is consistent with the isolated experiments.
    One interesting discovery is that using our method, the results of the three bi-text combination methods do not differ so much as the baselines.
  • To summarize, this graph shows the overall improvements that we obtain in our experiments.
    The first three bars are the baselines using existing methods,
    And the fourth one is our best isolated system, which improves about 1 BLEU point over the baselines.
    The last one is the best combined system, and it gives us 1.5 BLEU point improvement over the baselines.
  • We have also applied our method to other languages.
    We try to improve Macedonian-English SMT by adapting a Bulgarian-English bi-text.
    We get similar results.
    This confirms the applicability of our method to other language pairs.
  • Some related work.
  • We have also applied our method to other languages.
    We try to improve Macedonian-English SMT by adapting a Bulgarian-English bi-text.
    We get similar results.
    This confirms the applicability of our method to other language pairs.
  • Some related work.
  • We have also applied our method to other languages.
    We try to improve Macedonian-English SMT by adapting a Bulgarian-English bi-text.
    We get similar results.
    This confirms the applicability of our method to other language pairs.
  • Next I will conclude our work.
  • Next I will conclude our work.
  • In summary, to improve resource-poor machine translation, we adapt bi-texts for a related resource-rich language, using confusion networks, word-level and phrase-level paraphrasing, and morphological analysis.
    We achieved very sizable improvements over the baselines.

    In the future, we would like to add more word operations, for example, splitting, and merging words.
    We also want to find some methods to better integrate our word-level and phrase-level paraphrasing methods.
    Lastly, we want to apply our methods to other languages and NLP problems.
  • Some related work.
  • There are some related work on translating texts between related languages, just like our bi-text adaptation step.
    Most of the work use rule-based translation systems, but our method is a statistical method, which are language independent.
  • Transcript

    • 1. Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Preslav Nakov, Qatar Computing Research Institute (collaborators: Jorg Tiedemann, Pidong Wang, Hwee Tou Ng) Yandex seminar August 13, 2014, Moscow, Russia
    • 2. Yandex seminar, August 13, 2014, Moscow, Russia 2Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 2 Plan  Part I Introduction to Statistical Machine Translation  Part II Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation  Part III Further Discussion on SMT
    • 3. The Problem: Lack of Resources
    • 4. Yandex seminar, August 13, 2014, Moscow, Russia 4Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 4 Overview  Statistical Machine Translation (SMT) systems Need large sentence-aligned bilingual corpora (bi-texts).  Problem Such training bi-texts do not exist for most languages.  Idea Adapt a bi-text for a related resource-rich language.
    • 5. Yandex seminar, August 13, 2014, Moscow, Russia 5Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Building an SMT System for a New Language Pair  In theory: only requires few hours/days  In practice: large bi-texts are needed Only available for  the official languages of the UN  Arabic, Chinese, English, French, Russian, Spanish  the official languages of the EU  some other languages However, most of the 6,500+ world languages remain resource-poor from an SMT viewpoint. This number is even more striking if we consider language pairs. Even resource-rich language pairs become resource-poor in new domains.
    • 6. Yandex seminar, August 13, 2014, Moscow, Russia 6Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Most Language Pairs Have Little Resources Zipfian distribution of language resources
    • 7. Yandex seminar, August 13, 2014, Moscow, Russia 7Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Building a Bi-text for SMT  Small bi-texts Relatively easy to build  Large bi-texts Hard to get, e.g., because of copyright Sources: parliament debates and legislation national: Canada, Hong Kong international United Nations European Union: Europarl, Acquis Becoming an official language of the EU is an easy recipe for getting rich in bi-texts quickly. Not all languages are so “lucky”, but many can still benefit.
    • 8. Yandex seminar, August 13, 2014, Moscow, Russia 8Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation How Google/Bing (Yandex?) Translate Resource-Poor Languages  How do we translate from Russian to Malay?  Use Triangulation Cascaded translation (Utiyama & Isahara, 2007; Koehn & al., 2009)  RussianEnglishMalay Phrase Table Pivoting (Cohn & Lapata,2007; Wu & Wang, 2007)  рамочное соглашение ||| framework agreement ||| 0.7 …  perjanjian kerangka kerja ||| framework agreement ||| 0.8 … THUS  рамочное соглашение ||| perjanjian kerangka kerja ||| 0.56 …
    • 9. Yandex seminar, August 13, 2014, Moscow, Russia 9Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation  Idea: reuse bi-texts from related resource-rich languages to build an improved SMT system for a related resource-poor language.  NOTE 1: this is NOT triangulation we focus on translation into English  e.g., Indonesian-English using Malay-English  rather than IndonesianEnglishMalay IndonesianMalayEnglish  NOTE 2: We exploit the fact that the source languages are related What if We Want to Translate into English? poor rich X
    • 10. Yandex seminar, August 13, 2014, Moscow, Russia 10Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Resource-poor vs. Resource-rich
    • 11. Yandex seminar, August 13, 2014, Moscow, Russia 11Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 11  Related EU – non-EU/unofficial languages  Swedish – Norwegian  Bulgarian – Macedonian  Irish – Gaelic Scottish  Standard German – Swiss German  Related EU languages  Spanish – Catalan  Czech – Slovak  Related languages outside Europe  Russian – Ukrainian  MSA – Dialectical Arabic (e.g., Egyptian, Gulf, Levantine, Iraqi)  Hindi – Urdu  Turkish – Azerbaijani  Malay – Indonesian Resource-rich vs. Resource-poor Languages We will explore these pairs.
    • 12. Yandex seminar, August 13, 2014, Moscow, Russia 12Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation  Related languages have  overlapping vocabulary (cognates) similar  word order  syntax Motivation
    • 13. 13Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Improving Indonesian-English SMT Using Malay-English
    • 14. Yandex seminar, August 13, 2014, Moscow, Russia 14Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 14 Malay vs. Indonesian Malay  Semua manusia dilahirkan bebas dan samarata dari segi kemuliaan dan hak-hak.  Mereka mempunyai pemikiran dan perasaan hati dan hendaklah bertindak di antara satu sama lain dengan semangat persaudaraan. Indonesian  Semua orang dilahirkan merdeka dan mempunyai martabat dan hak-hak yang sama.  Mereka dikaruniai akal dan hati nurani dan hendaknya bergaul satu sama lain dalam semangat persaudaraan. ~50% exact word overlap from Article 1 of the Universal Declaration of Human Rights
    • 15. Yandex seminar, August 13, 2014, Moscow, Russia 15Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 15 Malay Can Look “More Indonesian”… Malay  Semua manusia dilahirkan bebas dan samarata dari segi kemuliaan dan hak-hak.  Mereka mempunyai pemikiran dan perasaan hati dan hendaklah bertindak di antara satu sama lain dengan semangat persaudaraan. ~75% exact word overlap Post-edited Malay to look “Indonesian” (by an Indonesian speaker). Indonesian  Semua manusia dilahirkan bebas dan mempunyai martabat dan hak-hak yang sama.  Mereka mempunyai pemikiran dan perasaan dan hendaklah bergaul satu sama lain dalam semangat persaudaraan. from Article 1 of the Universal Declaration of Human Rights We attempt to do this automatically: adapt Malay to look Indonesian Then, use it to improve SMT…
    • 16. Yandex seminar, August 13, 2014, Moscow, Russia 16Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Indonesian Malay English poor Method at a Glance Indonesian “Indonesian” English poorStep 1: Adaptation Indonesian + “Indonesian” English Step 2: Combination Adapt Note that we have no Malay-Indonesian bi-text!
    • 17. 17Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Step 1: Adapting Malay-English to “Indonesian”-English
    • 18. Yandex seminar, August 13, 2014, Moscow, Russia 18Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 18 Word-Level Bi-text Adaptation: Overview Given a Malay-English sentence pair 1. Adapt the Malay sentence to “Indonesian” • Word-level paraphrases • Phrase-level paraphrases • Cross-lingual morphology 2. We pair the adapted “Indonesian” with the English from the Malay-English sentence pair Thus, we generate a new “Indonesian”-English sentence pair. Source Language Adaptation for Resource-Poor Machine Translation. (EMNLP 2012) Pidong Wang, Preslav Nakov, Hwee Tou Ng
    • 19. Yandex seminar, August 13, 2014, Moscow, Russia 19Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 19 Word-Level Bi-text Adaptation: Motivation  In many cases, word-level substitutions are enough  Adapt Malay to Indonesian (train) KDNK Malaysia dijangka cecah 8 peratus pada tahun 2010. PDB Malaysia akan mencapai 8 persen pada tahun 2010. Malaysia’s GDP is expected to reach 8 per cent in 2010.
    • 20. Yandex seminar, August 13, 2014, Moscow, Russia 20Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 20 Malay: KDNK Malaysia dijangka cecah 8 peratus pada tahun 2010. Decode using a large Indonesian LM Word-Level Bi-text Adaptation: Overview Probs: pivoting over English
    • 21. Yandex seminar, August 13, 2014, Moscow, Russia 21Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Malaysia’s GDP is expected to reach 8 per cent in 2010. 21 Pair each with the English counter-part Thus, we generate a new “Indonesian”-English bi-text. Word-Level Bi-text Adaptation: Overview
    • 22. Yandex seminar, August 13, 2014, Moscow, Russia 22Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation  Indonesian translations for Malay: pivoting over English  Weights 22 Malay sentenceML1 ML2 ML3 ML4 ML5 English sentenceEN1 EN2 EN3 EN4 English sentenceEN11 EN3 EN12 Indonesian sentenceIN1 IN2 IN3 IN4 ML-EN bi-text IN-EN bi-text Word-Level Adaptation: Extracting Paraphrases Note: we have no Malay-Indonesian bi-text, so we pivot.
    • 23. Yandex seminar, August 13, 2014, Moscow, Russia 23Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation IN-EN bi-text is small, thus:  Unreliable IN-EN word alignments  bad ML-IN paraphrases  Solution:  improve IN-EN alignments using the ML-EN bi-text  concatenate: IN-EN*k + ML-EN » k ≈ |ML-EN| / |IN-EN|  word alignment  get the alignments for one copy of IN-EN only 23 Word-Level Adaptation: Issue 1 IN ML EN poor Works because of cognates between Malay and Indonesian.
    • 24. Yandex seminar, August 13, 2014, Moscow, Russia 24Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation IN-EN bi-text is small, thus:  Small IN vocabulary for the ML-IN paraphrases  Solution:  Add cross-lingual morphological variants:  Given ML word: seperminuman  Find ML lemma: minum  Propose all known IN words sharing the same lemma: » diminum, diminumkan, diminumnya, makan-minum, makananminuman, meminum, meminumkan, meminumnya, meminum-minuman, minum, minum-minum, minum-minuman, minuman, minumanku, minumannya, peminum, peminumnya, perminum, terminum 24 Word-Level Adaptation: Issue 2 IN ML EN poor Note: The IN variants are from a larger monolingual IN text.
    • 25. Yandex seminar, August 13, 2014, Moscow, Russia 25Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Word-level pivoting  Ignores context, and relies on LM  Cannot drop/insert/merge/split/reorder words  Solution: Phrase-level pivoting  Build ML-EN and EN-IN phrase tables  Induce ML-IN phrase table (pivoting over EN)  Adapt the ML side of ML-EN to get “IN”-EN bi-text: » using Indonesian LM and n-best “IN” as before  Also, use cross-lingual morphological variants 25 Word-Level Adaptation: Issue 3 - Models context better: not only Indonesian LM, but also phrases. - Allows many word operations, e.g., insertion, deletion. IN ML EN poor
    • 26. 26Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Step 2: Combining IN-EN + “IN”-EN
    • 27. Yandex seminar, August 13, 2014, Moscow, Russia 27Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Combining IN-EN and “IN”-EN bi-texts  Simple concatenation: IN-EN + “IN”-EN  Balanced concatenation: IN-EN * k + “IN”-EN  Sophisticated phrase table combination  Improved word alignments for IN-EN  Phrase table combination with extra features Improved Statistical Machine Translation for Resource-Poor Languages Using Related Resource-Rich Languages. (EMNLP 2009) Preslav Nakov, Hwee Tou Ng Improving Statistical Machine Translation for a Resource-Poor Language Using Related Resource-Rich Languages. (JAIR, 2012) Preslav Nakov, Hwee Tou Ng
    • 28. Yandex seminar, August 13, 2014, Moscow, Russia 28Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation  Concatenating bi-texts  Merging phrase tables  Combined method Bi-text Combination Strategies
    • 29. Yandex seminar, August 13, 2014, Moscow, Russia 29Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Concatenating bi-texts  Merging phrase tables  Combined method Bi-text Combination Strategies
    • 30. Yandex seminar, August 13, 2014, Moscow, Russia 30Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation  Summary: Concatenate X1-Y and X2-Y  Advantages  improved word alignments  e.g., for rare words  more translation options  less unknown words  useful non-compositional phrases (improved fluency)  phrases with words from X2 that do not exist in X1: ignored  Disadvantages  X2-Y will dominate: it is larger  translation probabilities are messed up  phrases from X1-Y and X2-Y cannot be distinguished X1 X2 Y poor related Concatenating Bi-texts (1)
    • 31. Yandex seminar, August 13, 2014, Moscow, Russia 31Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation  Concat×k: Concatenate k copies of the original and one copy of the additional training bi-text  Concat×k:align 1. Concatenate k copies of the original and one copy of the additional bi-text. 2. Generate word alignments. 3. Truncate them only keeping alignments for one copy of the original bi-text. 4. Build a phrase table. 5. Tune the system using MERT. The value of k is optimized on the development dataset. X1 X2 Y poor related Concatenating Bi-texts (2)
    • 32. Yandex seminar, August 13, 2014, Moscow, Russia 32Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation  Concatenating bi-texts Merging phrase tables  Combined method Bi-text Combination Strategies
    • 33. Yandex seminar, August 13, 2014, Moscow, Russia 33Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation  Summary: Build two separate phrase tables, then (a) use them together (b) merge them (c) interpolate them  Advantages phrases from X1-Y and X2-Y can be distinguished the larger bi-text X2-Y does not dominate X1-Y more translation options probabilities are combined in a more principled manner  Disadvantages improved word alignments are not possible X1 X2 Y poor related Merging Phrase Tables (1)
    • 34. Yandex seminar, August 13, 2014, Moscow, Russia 34Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation  Two-tables: Build two separate phrase tables and use them as alternative decoding paths (Birch et al., 2007). Merging Phrase Tables (2)
    • 35. Yandex seminar, August 13, 2014, Moscow, Russia 35Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation  Interpolation: Build two separate phrase tables, Torig and Textra, and combine them using linear interpolation: Pr(e|s) = αProrig(e|s) + (1 − α)Prextra(e|s). The value of α is optimized on a development dataset. Merging Phrase Tables (3)
    • 36. Yandex seminar, August 13, 2014, Moscow, Russia 36Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation  Merge: 1. Build separate phrase tables: Torig and Textra. 2. Keep all entries from Torig. 3. Add those entries from Textra that are not in Torig. 4. Add extra features:  F1: 1 if the entry came from Torig, 0 otherwise.  F2: 1 if the entry came from Textra, 0 otherwise.  F3: 1 if the entry was in both tables, 0 otherwise. The feature weights are set using MERT, and the number of features is optimized on the development set. Merging Phrase Tables (4)
    • 37. Yandex seminar, August 13, 2014, Moscow, Russia 37Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation  Concatenating bi-texts  Merging phrase tables Combined method Bi-text Combination Strategies
    • 38. Yandex seminar, August 13, 2014, Moscow, Russia 38Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation  Use Merge to combine the phrase tables for concat×k:align (as Torig) and for concat×1 (as Textra).  Two parameters to tune  number of repetitions k  # of extra features to use with Merge:  (a) F1 only;  (b) F1 and F2,  (c) F1, F2 and F3 Improved word alignments. Improved lexical coverage. Distinguish phrases by source table. Combined Method
    • 39. 39Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Experiments & Evaluation
    • 40. Yandex seminar, August 13, 2014, Moscow, Russia 40Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Data  Translation data (for IN-EN)  IN2EN-train: 0.9M  IN2EN-dev: 37K  IN2EN-test: 37K  EN-monoling.: 5M  Adaptation data (for ML-EN  “IN”-EN)  ML2EN: 8.6M  IN-monoling.: 20M (tokens)
    • 41. Yandex seminar, August 13, 2014, Moscow, Russia 41Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Isolated Experiments: Training on “IN”-EN only 14.50 18.67 19.50 20.06 20.63 20.89 21.24 14.00 15.00 16.00 17.00 18.00 19.00 20.00 21.00 22.00 BLEU System combination using MEMT (Heafield and Lavie, 2010) Wang, Nakov & Ng (EMNLP 2012)
    • 42. Yandex seminar, August 13, 2014, Moscow, Russia 42Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 18.49 19.79 20.10 21.55 21.64 21.62 18.0 18.5 19.0 19.5 20.0 20.5 21.0 21.5 22.0 simple concatenation balanced concatenation phrase table combination ML2EN(baseline) System combination 42 BLEU Combined Experiments: Training on IN-EN + “IN”-EN Wang, Nakov & Ng (EMNLP 2012)
    • 43. Yandex seminar, August 13, 2014, Moscow, Russia 43Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Experiments: Improvements 43 14.50 18.67 20.10 21.24 21.64 14.00 15.00 16.00 17.00 18.00 19.00 20.00 21.00 22.00 23.00 BLEU Wang, Nakov & Ng (EMNLP 2012)
    • 44. Yandex seminar, August 13, 2014, Moscow, Russia 44Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation  Improve Macedonian-English SMT by adapting Bulgarian-English bi-text  Adapt BG-EN (11.5M words) to “MK”-EN (1.2M words)  OPUS movie subtitles Application to Other Languages & Domains 27.33 27.97 28.38 29.05 27.00 27.50 28.00 28.50 29.00 29.50 BG2EN(A) WordParaph+morph(B) PhraseParaph+morph(C) System combination of A+B+C BLEU
    • 45. Yandex seminar, August 13, 2014, Moscow, Russia 45Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 45 Analysis
    • 46. Yandex seminar, August 13, 2014, Moscow, Russia 46Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Paraphrasing Non-Indonesian Malay Words Only So, we do need to paraphrase all words. Wang, Nakov & Ng (EMNLP 2012)
    • 47. Yandex seminar, August 13, 2014, Moscow, Russia 47Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Human Judgments Morphology yields worse top-3 adaptations but better phrase tables, due to coverage. Is the adapted sentence better Indonesian than the original Malay sentence? 100 random sentences Wang, Nakov & Ng (EMNLP 2012)
    • 48. Yandex seminar, August 13, 2014, Moscow, Russia 48Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Reverse Adaptation Idea: Adapt dev/test Indonesian input to “Malay”, then, translate with a Malay-English system Input to SMT: - “Malay” lattice - 1-best “Malay” sentence from the lattice Adapting dev/test is worse than adapting the training bi-text: So, we need both n-best and LM Wang, Nakov & Ng (EMNLP 2012)
    • 49. Yandex seminar, August 13, 2014, Moscow, Russia 49Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 49 A Specialized Decoder (Instead of Moses)
    • 50. Yandex seminar, August 13, 2014, Moscow, Russia 50Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Beam-Search Text Rewriting Decoder: The Algorithm A Beam-Search Decoder for Normalization of Social Media Text with Application to Machine Translation. (NAACL 2013). Pidong Wang, Hwee Tou Ng
    • 51. Yandex seminar, August 13, 2014, Moscow, Russia 51Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Beam-Search Text Rewriting Decoder: An Example (Twitter Normalization) Wang, Nakov & Ng (NAACL 2013)
    • 52. Yandex seminar, August 13, 2014, Moscow, Russia 52Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Hypothesis Producers  Word-level mapping  Phrase-level mapping  Cross-lingual morphology mapping  Indonesian LM  Word penalty (target)  Malay word penalty (source)  Phrase count Wang, Nakov & Ng (NAACL 2013)
    • 53. Yandex seminar, August 13, 2014, Moscow, Russia 53Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Moses vs. the Specialized Decoder  Decoding level  phrase vs. sentence  Features  Moses vs. richer, e.g., Malay word penalty word-level + phrase-level (potentially, manual rules)  Cross-lingual variants  input lattice vs. feature function Wang, Nakov & Ng (NAACL 2013)
    • 54. Yandex seminar, August 13, 2014, Moscow, Russia 54Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Moses vs. a Specialized Decoder: Isolated “IN”-EN Experiments: 19.50 20.06 20.63 20.89 21.24 20.39 20.46 20.85 21.07 21.76 18 19 20 21 22 WordPar WordPar+Morph PhrasePar PhrasePar+Morph System Combination BLEU Moses Specialized decoder
    • 55. Yandex seminar, August 13, 2014, Moscow, Russia 55Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 55 Moses vs. a Specialized Decoder: Combining IN-EN and “IN”-EN 18.49 19.79 20.10 21.55 21.64 21.6221.74 21.81 22.03 17 18 19 20 21 22 23 simple concat balanced concat phrase table combination BLEU ML2EN (baseline) Moses Specialized decoder
    • 56. Yandex seminar, August 13, 2014, Moscow, Russia 56Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Experiments: Improvements 56 14.50 18.67 20.10 21.24 21.64 22.03 13 15 17 19 21 23 ML2EN (baseline) IN2EN (baseline) phrase table combination (Moses) best isolated system (Moses) best combined system (Moses) best combination (DD) BLEU
    • 57. Yandex seminar, August 13, 2014, Moscow, Russia 57Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation MKEN, Adapting BG-EN to “MK”-EN 27.33 27.97 28.38 29.05 29.35 26 27 28 29 30 BG2EN wordPar+morph (Moses) PhrasePar+morph (Moses) combination (Moses) combination (DD) BLEU
    • 58. Yandex seminar, August 13, 2014, Moscow, Russia 58Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 58 Transliteration
    • 59. Yandex seminar, August 13, 2014, Moscow, Russia 59Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 59 Spanish vs. Portuguese  Spanish–Portuguese Spanish  Todos los seres humanos nacen libres e iguales en dignidad y derechos y, dotados como están de razón y conciencia, deben comportarse fraternalmente los unos con los otros. Portuguese  Todos os seres humanos nascem livres e iguais em dignidade e em direitos. Dotados de razão e de consciência, devem agir uns para com os outros em espírito de fraternidade. (from Article 1 of the Universal Declaration of Human Rights) 17% exact word overlap
    • 60. Yandex seminar, August 13, 2014, Moscow, Russia 60Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Spanish vs. Portuguese  Spanish–Portuguese Spanish  Todos los seres humanos nacen libres e iguales en dignidad y derechos y, dotados como están de razón y conciencia, deben comportarse fraternalmente los unos con los otros. Portuguese  Todos os seres humanos nascem livres e iguais em dignidade e em direitos. Dotados de razão e de consciência, devem agir uns para com os outros em espírito de fraternidade. (from Article 1 of the Universal Declaration of Human Rights) 17% exact word overlap 67% approx. word overlap The actual overlap is even higher.
    • 61. Yandex seminar, August 13, 2014, Moscow, Russia 61Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Cognates  Linguistics Def: Words derived from a common root, e.g.,  Latin tu (‘2nd person singular’)  Old English thou  French tu  Spanish tú  German du  Greek sú Orthography/phonetics/semantics: ignored.  Computational linguistics Def: Words in different languages that are mutual translations and have a similar orthography, e.g.,  evolution vs. evolución vs. evolução vs. evoluzione Orthography & semantics: important. Origin: ignored. Cognates can differ a lot: • night vs. nacht vs. nuit vs. notte vs. noite • star vs. estrella vs. stella vs. étoile • arbeit vs. rabota vs. robota (‘work’) • father vs. père • head vs. chef
    • 62. Yandex seminar, August 13, 2014, Moscow, Russia 62Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Spelling Differences Between Cognates  Systematic spelling differences Spanish – Portuguese  different spelling -nh-  -ñ- (senhor vs. señor)  phonetic -ción  -ção (evolución vs. evolução) -é  -ei (1st sing past) (visité vs. visitei) -ó  -ou (3rd sing past) (visitó vs. visitou)  Occasional differences Spanish – Portuguese  decir vs. dizer (‘to say’)  Mario vs. Mário  María vs. Maria Many of these can be learned automatically.
    • 63. Yandex seminar, August 13, 2014, Moscow, Russia 63Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Automatic Transliteration  Transliteration 1. Extract likely cognates for Portuguese-Spanish 2. Learn a character-level transliteration model 3. Transliterate the Portuguese side of pt-en, to look like Spanish
    • 64. Yandex seminar, August 13, 2014, Moscow, Russia 64Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Automatic Transliteration (2)  Extract pt-es cognates using English (en) 1. Induce pt-es word translation probabilities 2. Filter out by probability if 3. Filter out by orthographic similarity if constants proposed in the literature Longest common subsequence
    • 65. Yandex seminar, August 13, 2014, Moscow, Russia 65Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation SMT-based Transliteration Train & tune a monotone character-level SMT system  Representation  Use it to transliterate the Portuguese side of pt-en
    • 66. Yandex seminar, August 13, 2014, Moscow, Russia 66Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation ESEN, Adapting PT-EN to “ES”-EN 5.34 22.87 24.23 13.79 26.24 0 5 10 15 20 25 30 PT-EN ES-EN phrase table combination BLEU original transliterated 10K ES-EN, 1.23M PT-EN
    • 67. Yandex seminar, August 13, 2014, Moscow, Russia 67Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 67 Transliteration vs. Character-Level Translation
    • 68. Yandex seminar, August 13, 2014, Moscow, Russia 68Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Macedonian vs. Bulgarian 68
    • 69. Yandex seminar, August 13, 2014, Moscow, Russia 69Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation MKBG: Transliteration vs. Translation 10.74 12.07 22.74 31.10 32.19 32.71 33.94 5 10 15 20 25 30 35 40 MK (original) MK (simple translit.) MK (cognate translit.) MK-BG (words) MK-BG (words+cogn. translit.) MK-BG (chars) MK-BG (words+cogn. translit. + chars) BLEU Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages (ACL 2012). Preslav Nakov, Jorg Tiedemann.
    • 70. Yandex seminar, August 13, 2014, Moscow, Russia 70Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Character-Level SMT 70 • MK: Никогаш не сум преспала цела сезона. • BG: Никога не съм спала цял сезон. • MK: Н и к о г а ш _ н е _ с у м _ п р е с п а л а _ ц е л а _ с е з о н а _ . • BG: Н и к о г а _ н е _ с ъ м _ с п а л а _ ц я л _ с е з о н _ .
    • 71. Yandex seminar, August 13, 2014, Moscow, Russia 71Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Character-Level Phrase Pairs 71 Can cover:  word prefixes/suffixes  entire words  word sequences  combinations thereof Max-phrase-length=10 LM-order=10
    • 72. Yandex seminar, August 13, 2014, Moscow, Russia 72Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation MKBG: The Impact of Data Size
    • 73. Yandex seminar, August 13, 2014, Moscow, Russia 73Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Slavic Languages in Europe
    • 74. Yandex seminar, August 13, 2014, Moscow, Russia 74Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation BG MK SR CZ SL MK XX
    • 75. Yandex seminar, August 13, 2014, Moscow, Russia 75Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation MK SR, SL, CZ
    • 76. 76 Pivoting
    • 77. Yandex seminar, August 13, 2014, Moscow, Russia 77Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation MK->EN: Pivoting over BG Macedonian: Никогаш не сум преспала цела сезона. Bulgarian: Никога не съм спала цял сезон. English: I’ve never slept for an entire season. For related languages • subword transformations • character-level translation
    • 78. Yandex seminar, August 13, 2014, Moscow, Russia 78Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation MK->EN: Pivoting over BG Analyzing the Use of Character-Level Translation with Sparse and Noisy Datasets (RANLP 2013) Jorg Tiedemann, Preslav Nakov
    • 79. Yandex seminar, August 13, 2014, Moscow, Russia 79Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation MK->EN: Using Synthetic “MK”-EN Bi-Texts Translate Bulgarian to Macedonian in a BG-XX corpus All synthetic data combined (+mk-en): 36.69 BLEU Tiedemann & Nakov (RANLP 2013)
    • 80. 80 Conclusion
    • 81. Yandex seminar, August 13, 2014, Moscow, Russia 81Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation  Adapt bi-texts for related resource-rich languages, using  confusion networks  word-level & phrase-level paraphrasing  cross-lingual morphological analysis  Character-level Models  translation  transliteration  pivoting vs. synthetic data  Future work  other languages & NLP problems  robustness: noise and domain shift Thank you! Conclusion & Future Work
    • 82. Yandex seminar, August 13, 2014, Moscow, Russia 83Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation 83 Related Work
    • 83. Yandex seminar, August 13, 2014, Moscow, Russia 84Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Related Work (1)  Machine translation between related languages  E.g.  Cantonese–Mandarin (Zhang, 1998)  Czech–Slovak (Hajic & al., 2000)  Turkish–Crimean Tatar (Altintas & Cicekli, 2002)  Irish–Scottish Gaelic (Scannell, 2006)  Bulgarian–Macedonian (Nakov & Tiedemann, 2012)  We do not translate (no training data), we “adapt”.
    • 84. Yandex seminar, August 13, 2014, Moscow, Russia 85Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Related Work (2)  Adapting dialects to standard language (e.g., Arabic) (Bakr & al., 2008; Sawaf, 2010; Salloum & Habash, 2011)  manual rules and/or language-specific tools  Normalizing Tweets and SMS (Aw & al., 2006; Han & Baldwin, 2011)  informal text: spelling, abbreviations, slang  same language
    • 85. Yandex seminar, August 13, 2014, Moscow, Russia 86Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation Related Work (3)  Adapt Brazilian to European Portuguese (Marujo & al. 2011)  rule-based, language-dependent  tiny improvements for SMT  Reuse bi-texts between related languages (Nakov & Ng. 2009)  no language adaptation (just transliteration)  Cascaded/pivoted translation (Utiyama & Isahara, 2007; Cohn & Lapata, 2007; Wu & Wang, 2009)  poor  rich  X requires an additional poor-rich bi-text  rich  X  poor does not use the similarity poor-rich poor rich X our:

    ×