Final paper ashley_parinita


Published on

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Final paper ashley_parinita

  1. 1. Effects of Lemmatization on Czech-English Statistical MT Ashley Gill Parinita University of Washington University of Washington Seattle, WA Seattle, WA Abstract 2 System Overview Previous works have shown that the most prob- lematic parts of speech in Czech-English transla- The focus of this paper is to see if lemmatiza- tions are nouns and verbs (Bojar and Prokopov´a tion affects experiments with Czech-to- ,2006). In our experiments we aim to improve English phrase-based machine translation. We word alignments between Czech and English by vary the translation scenario by trying lemma- splitting the inflection of verbs and nouns to im- tization of different parts of speech in Czech. prove frequency of the stem words and hence Experimental results demonstrate significant improve alignments. Czech is a pro-drop lan- improvement of translation quality in terms of guage, pronouns representing the subject are BLEU. We then propose a simple step to im- usually left out but the morphology of the verb prove upon the translation output. This ap- indicates explicitly which pronoun was meant. proach is applicable to language pairs which vary in their morphological richness. By lemmatizing verbs only we hope to improve this misalignment. 1 Introduction For this task we ran four different experiments. The baseline experiment was carried out with no changes. We ran three different experiments with Czech, as a Slavic language is highly inflection- lemmatization to see their effect. ALemma - al. It is an almost free word-order language. Most where all words were lemmatized in the corpus, of the functions expressed in Czech as endings NLemma - where words that were tagged as (inflections) are rendered by English word order nouns were lemmatized only , and Vlemma - and function words. This causes fewer instances where words tagged as verbs was lemmatized of the surface form of a word (pre- fix+stem+suffix) to occur in the corpus. This 3 Components phenomenon, called data sparsity, is one of the 3.1 Corpus factors that degrade statistical machine transla- tion (SMT). Research in SMT increasingly Training data is taken from the new News Com- makes use of linguistic analysis in order to im- mentary corpus. The released data is not toke- prove performance. By including abstract catego- nized and includes sentences of any length (in- ries, such as lemmas and parts-of-speech (POS) cluding empty sentences). Also, all data is in Un- in the models, it is argued that systems can be- icode (UTF-8) format. To tune our system during come better at handling sentences for which development, we used development sets of 1057 training data at the word level is sparse. Existing sentences (News Commentary, nc-dev2007). The work has shown that using morpho-syntactic in- data is provided in raw text format. To test our formation is an effective solution to data sparse- system during development, we used 2007 sen- ness. We present a phrase-based statistical ma- tences (News Commentary, nc-test2007). chine translation approach which uses linguistic We tokenized the corpus and lowercased it. We analysis in the preprocessing phase. The linguis- trimmed sentences longer than 40 words, as tic analysis includes morphological transforma- needed by GIZA++. tion by applying lemmatization and trying vari- ous combinations to see improvement in word alignment and vocabulary reduction.
  2. 2. Input Output Special provision is made in the code for up to sentences sentences two "inflectional" prefixes which might both be original baseline 70048 62610 present in some word forms. Such prefixes are Table 1: Corpus Size before and after found in many Slavic languages, such as Czech, cleaning Slovak, Polish, etc. Czech positional morphology (Hajic, 2000) uses The Moses toolkit [Koehn et al., 2007] morphological tags consisting of 12 actively used represents the phrase-based machine translation positions, each stating the value of one morpho- and operates only on the word level. This toolkit logical category. Categories which are not rele- was used for preparing the data, building lan- vant for a given lemma (e.g. tense for nouns) are guage model, training the model, Tuning, run- assigned a special value. We made use of this ning system on development test set, and for positional information from the lemmatizer out- evaluation. put to re-create our lemmatized corpus for the different experiments. 3.2 Lemmatizer 3.3 PreProcessor For each experiment, the training and testing da- For the lemmatization we have used „The Free tasets need to be run through the FM lemmatizer Morphology (FM) ‟ tool ( Hajic 2001) . FM is a tool. Because of the style of the output from the pair of universal (i.e., language-independent) FM, we decided to first implement a simple sen- morphology tools (, FMGene- tence delimiter, “*”, since it does not occur natu- for analysis and generation of word rally in the corpus. We are then able to determine forms for inflective languages. It comes with a where to place line breaks between sentences, frequency-based, high coverage Czech dictio- without disrupting the naturally-occuring sen- nary. It takes as its input a text file and returns tence punctuation. Next, for each word from the the output in csts-like SGML markup, with one FM file, depending on the particular experiment, token per line. we decide whether to use the original word or the lemma. For the Alemma experiment, we used the Examples: Input: lemma instead of the original word for every Prezident rezignoval na svou funkci. word in the FM output file. For the Nlemma ex- Output: periment, we used the lemma only if the first <csts> position of the FM output markup is “N” (denot- <f ing a noun). And for the Vlemma experiment, we cap>Prezident<MMl>prezident<MMt>NNMS1-- ---A---- used the lemma only if the first position of the <f>rezignoval<MMl>rezignovat_:T<MMt>VpY FM output markup is “V” (denoting a verb). Af- S---XR-AA--- ter the preprocessing step was complete, we <f>na<MMl>na<MMt>RR--4---------- saved the outputs in the UTF-8 format, to ensure <MMt>RR--6---------- <f>svou<MMl>svůj- compatibility with the GIZA++ and Moses Sys- 1_^(přivlast.)<MMt>P8FS4--------- tems. 1<MMt>P8FS7---------1 <f>funkci<MMl>funkce<MMt>NNFS3-----A--- 3.4 Post Pre-Processing corpus -<MMt>NNFS4-----A----<MMt>NNFS6-----A-- -- Lemmatization increased the number of words <D> per sentence, after the pre-processing was com- <d>.<MMl>.<MMt>Z:------------- plete. We removed sentences which were </csts> greater than 40 words. This reduced the corpus Table 2: Sample FM analyzer output size substantially. The comparison with the original baseline was not a fair one. To get a The FM can work for other morphologically rich better idea, we used a smaller subset of the un- inflective languages which can be described us- lemmatized corpus and treated it as the baseline ing segmentation of a word form into two parts: system. This does not compare the same sen- a root and an ending. Even if linguistically not tences for word alignment but gives a fair com- quite justified, many phenomena which would parison of the alignment given corpus of simi- normally break this simple rule can be made to lar size with different settings. work in this framework.
  3. 3. provement in BLEU scores for the various expe- riments we carried out. Input Output sentences sentences Experiment BLEU BASELINE: 4.24, 27.9/9.3/2.2/0.7 baseline 35000 14453 (BP=0.931, ratio=0.933, hyp_len=46470, ref_len=49805) ALEMMA 70048 13136 ALEMMA: 8.60, 36.4/13.7/5.2/2.1 (BP=1.000, ratio=1.177, NLEMMA 70048 15737 hyp_len=58645, ref_len=49805) NLEMMA: 10.09, 40.0/15.7/6.2/2.7 VLEMMA 70048 20686 (BP=1.000, ratio=1.108, hyp_len=55174, ref_len=49805) Table 3: Experiment corpus before and af- VLEMMA: 13.06, 44.1/19.1/8.5/4.1 ter cleaning (BP=1.000, ratio=1.017, hyp_len=50652, ref_len=49805) The output sentences were then used for build- original 18.89, 53.0/27.0/14.1/7.9 ing the language models and training. baseline (full (BP=0.946, ratio=0.947, corpus) hyp_len=47182, ref_len=49805) We saw improvements in data sparsity by lem- Table 5: BLEU scores matization as has been proved in various papers before. The improvements we saw for our data The BLEU scores look in agreement with the set is listed in Table4 decoded translations, we can see a remarkable improvements in the translated text as can be Total Vocabulary Singletons number (Number of seen in Table 6 of words unique (Words occurring words) only once) BASELINE: rasov~[ rozd~[lená europe typickým evropské extrémní of the right , there is a sign of její racism , and that že využívá imigra~Mní otázku in svůj Baseline 364762 23368 12258 politický prosp~[ch . italská lega nord , nizozemský vlaams blocks , francouzská ALEMMA 336507 13502 6560 penova defensive on national , this vše are p~Yíklady par- ties ~Mi hnutí vzešlých from spole~Mné aa NLEMMA 398979 17333 8772 verze vů~Mi imigrantům and prosazujících zjednodušující to look at how ~Yešit otázku p~Yist~[hovalců . VLEMMA 537715 21475 10136 ALemma: rasov~R , divided europe in fact , european the extreme right is its racism and that Table 4: Reduction in data spareseness using imigra~R is the question in their political of would . italy ' s nord lego , the dutch , vlaams blockade , the french has come . as to how souë jmen . it ' s rule of money ' s ad- 4 Results ministration national fronts - all of this ii s an example sides poorer or vze movement , the rise of the The BLEU scores show remarkable improve- common averze against immigrants and pushing the ) , sim- ment in the lemmatized corpus. It almost doubles plifies a view , how many out to question the immigrants . the score for the baseline. Lemmatizing only nouns increases the scores even further, but the NLemma: race-specific divided europe best BLEU scores are seen when we lemmatize in fact the extreme right is its racism and that applied to the immigration question in their political of europe . only the verbs. The scores obtained are thrice the indeed , the lego , nord , the dutch vlaams bloc , the french baseline score. Lemmatizing verbs are useful not still penova combatants national - all of this are examples only in improving the BLEU scores, but also can parties themselves or movements be held and of from the be used to improve the translation‟s readability. common averze towards immigrants and pushing the the simplest a view , the solution is to question the immigrants . Once the verbs are aligned correctly, mostly the nouns are the only words that remain to be trans- VLemma: race-specific divided europe lated. Thus after the VLemma translation is com- in fact the extreme right is its rasismus and that use immi- plete, a simple post-processing step that replaces gration is the question in their political favor . the nouns using a dictionary can improve the indeed , the lega nord nizozemsk vlaams blockade , the french still penova national fronts - all of this is happening translations further. Table 5 shows the im- parties or movement would be held and of from the com-
  4. 4. mon averze towards imigrant and pushing of makes it easi- Experiments like effect of lemmatization of both er to this view , to question the immigrants . the source and the target language on the word TABLE 6: Sample Outputs alignments can be carried out. (Zhang et. al, 2007). We propose that lemmatizing a specific part- of-speech improves the word alignments. And that there are two possible approaches that can be 6 Future Paths derived from this fact for a complete translation. One is to create a pipeline of translations. For Due to time constraint we could not carry out e.g. the output of NLemma becomes the source experiments to measure if a pipeline of lemmati- language of VLemma, and translation is done zation improves translation quality or not. In fu- from a „verb lemmatized NLemma output‟ to ture we would like to compare both the dictio- English. The second approach is to apply a dic- nary method and the pipeline method using tionary or lexicon on the translation output of the BLEU scores and human evaluation metric. lemmatized corpus. Looking at the output in Ta- We would also like to add more syntactic in- ble 6 of the VLemma, we can apply a simple dic- formation to improve the word reordering and tionary as in Table 7 language modelling. We would like to carry ex- periments on other languages too. Czech English POS rasismus racialism noun rasismus racism noun 7 Conclusion penova foam noun averze abhorrence noun We have studied the effect of lemmatization of averze disliking noun different parts of speech in improving the word averze loathing noun alignment for Czech-to-English SMT. We found imigrant immigrant noun that lemmatization of verbs yields the maximum Table 7: Sample Dictionary improvement in BLEU scores. We have con- cluded with an approach to improve the transla- The output we get will have the non-resolved tions by lemmatizing verbs to improve align- nouns translated correctly. ments and then replacing the unresolved nouns, adjectives by using a Czech-English dictionary. VLEMMA-DICT: race-specific divided europe in fact the extreme right is its rasismus [racism] and that This approach is applicable to translations from use immigration is the question in their political favor . any morphologically rich language to a simpler indeed , the lega nord nizozemsk vlaams blockade , the french still penova[foam] national fronts - all of this is hap- one. pening parties or movement would be held and of from the common averse[loathing] towards immi- References grant[immigrants] and pushing of makes it easier to this view , to question the immigrants . Adria de Gispert, Deepa Gupta, Maja Popovic, Patrik TABLE 8: Post-Processed Output lambert, Jose B. Marino, Marcello Federico, Her- mann Ney and Rafael Banchs. 2006. Improving Statistical Word Alignments with Morpho-syntactic 5 Limitations Transformations. In Advances in natural Language Processing, Vol. 4139:368-379. The results are compared only using BLEU Bettina Schrader. 2004. Improving word alignment scores. Specific phenomenon like „pronoun quality using linguistic knowledge. In Workshop dropping‟ that occurs in Czech is not tested for proceedings of the Fourth International Conference accuracy in translations. The pre-processing on Language Resources and Evaluation (LREC), steps implemented were focused more towards 46-49. BLEU score improvement vs. improvement in Maria Holmqvist, Sara Stymne and Lars Ahrenberg. actual translations. While that is a standard me- 2007. Getting to know Moses: Initial experiments tric for evaluation, other phenomenon could have on German-English factored translation. In Pro- undergone Human Cross evaluation for better ceedings of ACL Second Workshop in Statistical understanding of the improvement in results. Machine Translation, 181-184. Ondr̆ ej Bojar . and Magdalena Prokopov´a., Czech- The experiments does not cover the effect of English Word Alignment.In Proceedings of morphology of target language on translations. LREC'06, pp. 1236-1239, ELRA, 2006.
  5. 5. Sessions, pp. 177-180, Association for Computational Linguistics, Prague, Czech Republic, 2007. Ondr̆ ej Bojar , Evgeny Matusov, and Hermann Ney. 2006. Czech-English Phrase-Based Machine Translation. Institute of Formal and Applied Lin- guistics, Czech Republic. Ondrej Bojar, David Marecek, Vaclav Novak, Martin Popel, Jan Ptacek, Jan Rous and Zdenek Za- bokrtsky. 2008. English-Czech MT in 2009. In Proceedings of the Fourth Workshop on Statistical Machine Translation, 125-129. Ondrej Bojar and Jan Hajic. 2008. Phrase-Based and Deep Syntactic English-to-Czech Statistical Ma- chine Translation. In Proceedings of the Third Workshop on Statistical Machine Translation, 143- 146. Sharon Goldwater and David McClosky. 2005. Im- proving Statistical MT through Morphological Analysis. In Proceedings of the conference on Hu- man Language Technology and Empirical Methods in Natural Language Processing, 676-683. Ruiqiang Zhang and Eiichiro Sumita. 2007. Boosting Statistical Machine Translation by Lemmatization and Linear Interpolation. National Institute of In- formation and Communications Technology, Spo- ken Language Communication Research Laborato- ries, Japan. Hua Wu, Haifeng Wang and Zhanyi Liu. 2006. Alter- nation. Boosting Statistical Word Alignment Using Labeled and Unlabeled Data. Toshiba Research and Development Center, China . Dan Gusfield. 1997. Algorithms on Strings, Trees and Sequences. Cambridge University Press, Cambridge, UK. Chiang, D. 2005. A hierarchical phrase-based model for statistical machine translation. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL‟05), Ann Arbor, Michigan, Association for Computational Linguis- tics. Kristina Toutanova and Colin Cherry. 2009. A global model for joint lemmatization and part-of-speech prediction. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Lan- guage Processing of the AFNLP,pages 486–494, Suntec, Singapore, August. Association for Com- putational Linguistics. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., and Herbst, E.. Moses: Open Source Toolkit for Statistical Machine Translation. In ACL 2007, Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Compa- nion Volume Proceedings of the Demo and Poster