Effects of Lemmatization on Czech-English Statistical MT
Ashley Gill Parinita
University of Washington University of Washington
Seattle, WA Seattle, WA
Abstract 2 System Overview
Previous works have shown that the most prob-
lematic parts of speech in Czech-English transla-
The focus of this paper is to see if lemmatiza- tions are nouns and verbs (Bojar and Prokopov´a
tion affects experiments with Czech-to- ,2006). In our experiments we aim to improve
English phrase-based machine translation. We word alignments between Czech and English by
vary the translation scenario by trying lemma- splitting the inflection of verbs and nouns to im-
tization of different parts of speech in Czech. prove frequency of the stem words and hence
Experimental results demonstrate significant improve alignments. Czech is a pro-drop lan-
improvement of translation quality in terms of guage, pronouns representing the subject are
BLEU. We then propose a simple step to im- usually left out but the morphology of the verb
prove upon the translation output. This ap-
indicates explicitly which pronoun was meant.
proach is applicable to language pairs which
vary in their morphological richness. By lemmatizing verbs only we hope to improve
1 Introduction For this task we ran four different experiments.
The baseline experiment was carried out with no
changes. We ran three different experiments with
Czech, as a Slavic language is highly inflection- lemmatization to see their effect. ALemma -
al. It is an almost free word-order language. Most where all words were lemmatized in the corpus,
of the functions expressed in Czech as endings NLemma - where words that were tagged as
(inflections) are rendered by English word order nouns were lemmatized only , and Vlemma -
and function words. This causes fewer instances where words tagged as verbs was lemmatized
of the surface form of a word (pre-
fix+stem+suffix) to occur in the corpus. This 3 Components
phenomenon, called data sparsity, is one of the
factors that degrade statistical machine transla-
tion (SMT). Research in SMT increasingly Training data is taken from the new News Com-
makes use of linguistic analysis in order to im- mentary corpus. The released data is not toke-
prove performance. By including abstract catego- nized and includes sentences of any length (in-
ries, such as lemmas and parts-of-speech (POS) cluding empty sentences). Also, all data is in Un-
in the models, it is argued that systems can be- icode (UTF-8) format. To tune our system during
come better at handling sentences for which development, we used development sets of 1057
training data at the word level is sparse. Existing sentences (News Commentary, nc-dev2007). The
work has shown that using morpho-syntactic in- data is provided in raw text format. To test our
formation is an effective solution to data sparse- system during development, we used 2007 sen-
ness. We present a phrase-based statistical ma- tences (News Commentary, nc-test2007).
chine translation approach which uses linguistic We tokenized the corpus and lowercased it. We
analysis in the preprocessing phase. The linguis- trimmed sentences longer than 40 words, as
tic analysis includes morphological transforma- needed by GIZA++.
tion by applying lemmatization and trying vari-
ous combinations to see improvement in word
alignment and vocabulary reduction.
Input Output Special provision is made in the code for up to
sentences sentences two "inflectional" prefixes which might both be
original baseline 70048 62610 present in some word forms. Such prefixes are
Table 1: Corpus Size before and after found in many Slavic languages, such as Czech,
cleaning Slovak, Polish, etc.
Czech positional morphology (Hajic, 2000) uses
The Moses toolkit [Koehn et al., 2007] morphological tags consisting of 12 actively used
represents the phrase-based machine translation positions, each stating the value of one morpho-
and operates only on the word level. This toolkit logical category. Categories which are not rele-
was used for preparing the data, building lan- vant for a given lemma (e.g. tense for nouns) are
guage model, training the model, Tuning, run- assigned a special value. We made use of this
ning system on development test set, and for positional information from the lemmatizer out-
evaluation. put to re-create our lemmatized corpus for the
3.2 Lemmatizer 3.3 PreProcessor
For each experiment, the training and testing da-
For the lemmatization we have used „The Free tasets need to be run through the FM lemmatizer
Morphology (FM) ‟ tool ( Hajic 2001) . FM is a tool. Because of the style of the output from the
pair of universal (i.e., language-independent) FM, we decided to first implement a simple sen-
morphology tools (FMAnalyze.pl, FMGene- tence delimiter, “*”, since it does not occur natu-
rate.pl) for analysis and generation of word
rally in the corpus. We are then able to determine
forms for inflective languages. It comes with a where to place line breaks between sentences,
frequency-based, high coverage Czech dictio- without disrupting the naturally-occuring sen-
nary. It takes as its input a text file and returns tence punctuation. Next, for each word from the
the output in csts-like SGML markup, with one FM file, depending on the particular experiment,
token per line. we decide whether to use the original word or the
lemma. For the Alemma experiment, we used the
lemma instead of the original word for every
Prezident rezignoval na svou funkci. word in the FM output file. For the Nlemma ex-
Output: periment, we used the lemma only if the first
<csts> position of the FM output markup is “N” (denot-
<f ing a noun). And for the Vlemma experiment, we
---A---- used the lemma only if the first position of the
<f>rezignoval<MMl>rezignovat_:T<MMt>VpY FM output markup is “V” (denoting a verb). Af-
S---XR-AA--- ter the preprocessing step was complete, we
<f>na<MMl>na<MMt>RR--4---------- saved the outputs in the UTF-8 format, to ensure
<f>svou<MMl>svůj- compatibility with the GIZA++ and Moses Sys-
<f>funkci<MMl>funkce<MMt>NNFS3-----A--- 3.4 Post Pre-Processing corpus
-- Lemmatization increased the number of words
<D> per sentence, after the pre-processing was com-
<d>.<MMl>.<MMt>Z:------------- plete. We removed sentences which were
</csts> greater than 40 words. This reduced the corpus
Table 2: Sample FM analyzer output size substantially. The comparison with the
original baseline was not a fair one. To get a
The FM can work for other morphologically rich better idea, we used a smaller subset of the un-
inflective languages which can be described us- lemmatized corpus and treated it as the baseline
ing segmentation of a word form into two parts: system. This does not compare the same sen-
a root and an ending. Even if linguistically not tences for word alignment but gives a fair com-
quite justified, many phenomena which would parison of the alignment given corpus of simi-
normally break this simple rule can be made to lar size with different settings.
work in this framework.
provement in BLEU scores for the various expe-
riments we carried out.
sentences sentences Experiment BLEU
BASELINE: 4.24, 27.9/9.3/2.2/0.7
baseline 35000 14453 (BP=0.931, ratio=0.933,
ALEMMA 70048 13136 ALEMMA: 8.60, 36.4/13.7/5.2/2.1
NLEMMA 70048 15737 hyp_len=58645, ref_len=49805)
NLEMMA: 10.09, 40.0/15.7/6.2/2.7
VLEMMA 70048 20686 (BP=1.000, ratio=1.108,
Table 3: Experiment corpus before and af- VLEMMA: 13.06, 44.1/19.1/8.5/4.1
ter cleaning (BP=1.000, ratio=1.017,
The output sentences were then used for build- original 18.89, 53.0/27.0/14.1/7.9
ing the language models and training. baseline (full (BP=0.946, ratio=0.947,
corpus) hyp_len=47182, ref_len=49805)
We saw improvements in data sparsity by lem-
Table 5: BLEU scores
matization as has been proved in various papers
before. The improvements we saw for our data
The BLEU scores look in agreement with the
set is listed in Table4
decoded translations, we can see a remarkable
improvements in the translated text as can be
Total Vocabulary Singletons
number (Number of seen in Table 6
of words unique (Words occurring
words) only once) BASELINE: rasov~[ rozd~[lená europe
typickým evropské extrémní of the right , there is a sign of
její racism , and that že využívá imigra~Mní otázku in svůj
Baseline 364762 23368 12258 politický prosp~[ch .
italská lega nord , nizozemský vlaams blocks , francouzská
ALEMMA 336507 13502 6560 penova defensive on national , this vše are p~Yíklady par-
ties ~Mi hnutí vzešlých from spole~Mné aa
NLEMMA 398979 17333 8772 verze vů~Mi imigrantům and prosazujících zjednodušující
to look at how ~Yešit otázku p~Yist~[hovalců .
VLEMMA 537715 21475 10136 ALemma: rasov~R , divided europe
in fact , european the extreme right is its racism and that
Table 4: Reduction in data spareseness using imigra~R is the question in their political of would .
italy ' s nord lego , the dutch , vlaams blockade , the french
has come . as to how souë jmen . it ' s rule of money ' s ad-
4 Results ministration national fronts - all of this ii
s an example sides poorer or vze movement , the rise of the
The BLEU scores show remarkable improve- common averze against immigrants and pushing the ) , sim-
ment in the lemmatized corpus. It almost doubles plifies a view , how many out to question the immigrants .
the score for the baseline. Lemmatizing only
nouns increases the scores even further, but the NLemma: race-specific divided europe
best BLEU scores are seen when we lemmatize in fact the extreme right is its racism and that applied to the
immigration question in their political of europe .
only the verbs. The scores obtained are thrice the indeed , the lego , nord , the dutch vlaams bloc , the french
baseline score. Lemmatizing verbs are useful not still penova combatants national - all of this are examples
only in improving the BLEU scores, but also can parties themselves or movements be held and of from the
be used to improve the translation‟s readability. common averze towards immigrants and pushing the the
simplest a view , the solution is to question the immigrants .
Once the verbs are aligned correctly, mostly the
nouns are the only words that remain to be trans- VLemma: race-specific divided europe
lated. Thus after the VLemma translation is com- in fact the extreme right is its rasismus and that use immi-
plete, a simple post-processing step that replaces gration is the question in their political favor .
the nouns using a dictionary can improve the indeed , the lega nord nizozemsk vlaams blockade , the
french still penova national fronts - all of this is happening
translations further. Table 5 shows the im- parties or movement would be held and of from the com-
mon averze towards imigrant and pushing of makes it easi- Experiments like effect of lemmatization of both
er to this view , to question the immigrants . the source and the target language on the word
TABLE 6: Sample Outputs alignments can be carried out. (Zhang et. al,
We propose that lemmatizing a specific part-
of-speech improves the word alignments. And
that there are two possible approaches that can be 6 Future Paths
derived from this fact for a complete translation.
One is to create a pipeline of translations. For Due to time constraint we could not carry out
e.g. the output of NLemma becomes the source experiments to measure if a pipeline of lemmati-
language of VLemma, and translation is done zation improves translation quality or not. In fu-
from a „verb lemmatized NLemma output‟ to ture we would like to compare both the dictio-
English. The second approach is to apply a dic- nary method and the pipeline method using
tionary or lexicon on the translation output of the BLEU scores and human evaluation metric.
lemmatized corpus. Looking at the output in Ta- We would also like to add more syntactic in-
ble 6 of the VLemma, we can apply a simple dic- formation to improve the word reordering and
tionary as in Table 7 language modelling. We would like to carry ex-
periments on other languages too.
Czech English POS
rasismus racialism noun
rasismus racism noun 7 Conclusion
penova foam noun
averze abhorrence noun We have studied the effect of lemmatization of
averze disliking noun different parts of speech in improving the word
averze loathing noun alignment for Czech-to-English SMT. We found
imigrant immigrant noun that lemmatization of verbs yields the maximum
Table 7: Sample Dictionary improvement in BLEU scores. We have con-
cluded with an approach to improve the transla-
The output we get will have the non-resolved tions by lemmatizing verbs to improve align-
nouns translated correctly. ments and then replacing the unresolved nouns,
adjectives by using a Czech-English dictionary.
VLEMMA-DICT: race-specific divided europe
in fact the extreme right is its rasismus [racism] and that This approach is applicable to translations from
use immigration is the question in their political favor .
any morphologically rich language to a simpler
indeed , the lega nord nizozemsk vlaams blockade , the
french still penova[foam] national fronts - all of this is hap- one.
pening parties or movement would be held and of from the
common averse[loathing] towards immi- References
grant[immigrants] and pushing of makes it easier to this
view , to question the immigrants . Adria de Gispert, Deepa Gupta, Maja Popovic, Patrik
TABLE 8: Post-Processed Output lambert, Jose B. Marino, Marcello Federico, Her-
mann Ney and Rafael Banchs. 2006. Improving
Statistical Word Alignments with Morpho-syntactic
5 Limitations Transformations. In Advances in natural Language
Processing, Vol. 4139:368-379.
The results are compared only using BLEU Bettina Schrader. 2004. Improving word alignment
scores. Specific phenomenon like „pronoun quality using linguistic knowledge. In Workshop
dropping‟ that occurs in Czech is not tested for proceedings of the Fourth International Conference
accuracy in translations. The pre-processing on Language Resources and Evaluation (LREC),
steps implemented were focused more towards 46-49.
BLEU score improvement vs. improvement in Maria Holmqvist, Sara Stymne and Lars Ahrenberg.
actual translations. While that is a standard me- 2007. Getting to know Moses: Initial experiments
tric for evaluation, other phenomenon could have on German-English factored translation. In Pro-
undergone Human Cross evaluation for better ceedings of ACL Second Workshop in Statistical
understanding of the improvement in results. Machine Translation, 181-184.
Ondr̆ ej Bojar . and Magdalena Prokopov´a., Czech-
The experiments does not cover the effect of English Word Alignment.In Proceedings of
morphology of target language on translations. LREC'06, pp. 1236-1239, ELRA, 2006.
Sessions, pp. 177-180, Association for Computational
Linguistics, Prague, Czech Republic, 2007.
Ondr̆ ej Bojar , Evgeny Matusov, and Hermann Ney.
2006. Czech-English Phrase-Based Machine
Translation. Institute of Formal and Applied Lin-
guistics, Czech Republic.
Ondrej Bojar, David Marecek, Vaclav Novak, Martin
Popel, Jan Ptacek, Jan Rous and Zdenek Za-
bokrtsky. 2008. English-Czech MT in 2009. In
Proceedings of the Fourth Workshop on Statistical
Machine Translation, 125-129.
Ondrej Bojar and Jan Hajic. 2008. Phrase-Based and
Deep Syntactic English-to-Czech Statistical Ma-
chine Translation. In Proceedings of the Third
Workshop on Statistical Machine Translation, 143-
Sharon Goldwater and David McClosky. 2005. Im-
proving Statistical MT through Morphological
Analysis. In Proceedings of the conference on Hu-
man Language Technology and Empirical Methods
in Natural Language Processing, 676-683.
Ruiqiang Zhang and Eiichiro Sumita. 2007. Boosting
Statistical Machine Translation by Lemmatization
and Linear Interpolation. National Institute of In-
formation and Communications Technology, Spo-
ken Language Communication Research Laborato-
Hua Wu, Haifeng Wang and Zhanyi Liu. 2006. Alter-
nation. Boosting Statistical Word Alignment Using
Labeled and Unlabeled Data. Toshiba Research
and Development Center, China .
Dan Gusfield. 1997. Algorithms on Strings, Trees
and Sequences. Cambridge University Press,
Chiang, D. 2005. A hierarchical phrase-based model
for statistical machine translation. In: Proceedings
of the 43rd Annual Meeting of the Association for
Computational Linguistics (ACL‟05), Ann Arbor,
Michigan, Association for Computational Linguis-
Kristina Toutanova and Colin Cherry. 2009. A global
model for joint lemmatization and part-of-speech
prediction. In Proceedings of the Joint Conference
of the 47th Annual Meeting of the ACL and the 4th
International Joint Conference on Natural Lan-
guage Processing of the AFNLP,pages 486–494,
Suntec, Singapore, August. Association for Com-
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C.,
Federico, M., Bertoldi, N., Cowan, B., Shen, W.,
Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin,
A., and Herbst, E.. Moses: Open Source Toolkit
for Statistical Machine Translation. In ACL 2007,
Proceedings of the 45th Annual Meeting of the
Association for Computational Linguistics Compa-
nion Volume Proceedings of the Demo and Poster