Lemmatizer czechtoenglish ml

934 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
934
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Use morphology to improve translation from a morphologically rich language to English.
  • System Overview
  • Comparable number of output sentences after cleaning the corpus to 40 token limit per sentence.
  • OUTPUTS : Improvement in outputs can be seen for the various experiment. The only words left unaligned are the nouns in VLEMMA output. ( as shown in bold ). Word alignments improved but what about syntax , sentence order ?
  • A simple dictionary can be added to improve the output from VLEMMA translations. Once the verbs are aligned and translated correctly , we can apply a simple Czech-English dictionary and replace the correct translations of the nouns and adjectives
  • Lemmatizer czechtoenglish ml

    1. 1. Effects of Lemmatization on Czech-English Statistical MT<br />Ashley Gill<br />University of Washington<br />Seattle, WA<br />gillak@u.washington.edu<br />Parinita<br />University of Washington<br />Seattle, WA<br />parinita@u.washington.edu<br />
    2. 2. Motivation<br />Morphologically rich language -> English<br />Source (Czech)<br />-functions expressed as endings (inflections) <br />-fewer instances of the surface form of a word (prefix+stem+suffix) occur in the corpus, data sparsity<br />- Free word-order<br />Target (English)<br /><ul><li>word order
    3. 3. function words</li></ul>Goal<br />- to improve word-alignments<br />Approach<br /><ul><li>analyze surface word forms into lemma and morphology, e.g.: car +plural
    4. 4. translate lemma and morphology separately
    5. 5. generate target surface form
    6. 6. experiment with the different POS</li></li></ul><li>Experiments<br />Most problematic parts of speech in Czech-English translations are nouns and verbs (Bojar and Prokopov´a ,2006).<br />The baseline - no changes. <br />ALemma - all words were lemmatized <br />NLemma - nouns were lemmatized only<br />Vlemma - verbs were lemmatized only<br />
    7. 7. System Overview<br />Source Corpus<br />Lemmatizer<br />lele<br />Alemma<br />Nlemma<br />Vlemma<br />Baseline<br />Lemmatized Source Corpus<br />Moses Toolkit<br />Target Translation<br />
    8. 8. Lemmatizer<br />‘The Free Morphology (FM)’ tool (Hajic 2001). <br /> universal (i.e., language-independent) morphology tool (FMAnalyze.pl)<br /> analysis of word forms for inflective languages.<br /> includes a frequency-based, high coverage Czech dictionary. <br />Czech positional morphology (Hajic, 2000) uses morphological tags consisting of 12 actively used positions, each stating the value of one morphological category – we used tags for Nouns and Verbs <br />
    9. 9. Preprocessing<br />The output from the FM – one token per line<br />No markup for sentence delimiter<br />Inserted a simple sentence delimiter, “*”, in the corpus ( it does not occur naturally in the corpus)<br />For each word from the FM file: <br />Alemma experiment - use the lemma instead of the original word<br />Nlemmaexperiment – use the lemma only if the first position of the FM output markup is “N” (denoting a noun)<br />Vlemma experiment - use the lemma only if the first position of the FM output markup is “V” (denoting a verb).<br />
    10. 10. Corpus:<br />We used a corpus of about half the size as the baseline to compare with .<br />35,000 lines, which ends up using 14453 lines after removing sentences > 40 tokens.<br />
    11. 11. Results:<br />Improvement in BLEU scores , double for ALEMMA ,<br />and triple for VERBS lemmatized only<br />
    12. 12. BASELINE OUTPUT: rasov~[ rozd~[lenáeuropetypickýmevropskéextrémní of the right , there is a sign of její racism , and that ževyužíváimigra~Mníotázku in svůjpolitickýprosp~[ch .italskáleganord , nizozemskývlaams blocks , francouzskápenova defensive on national , this vše are p~Yíklady parties ~Mi hnutívzešlých from spole~Mnéaaverzevů~Miimigrantům and prosazujícíchzjednodušující to look at how ~Yešitotázkup~Yist~[hovalců .<br />ALEMMA OUTPUT: rasov~R , divided europein fact , european the extreme right is its racism and that using imigra~R is the question in their political of would .italy ' s nordlego , the dutch , vlaams blockade , the french has come . as to how souëjmen . it ' s rule of money ' s administration national fronts - all of this iis an example sides poorer or vze movement , the rise of the common averze against immigrants and pushing the ) , simplifies a view , how many out to question the immigrants .<br />NLEMMA OUTPUT: race-specific divided europein fact the extreme right is its racism and that applied to the immigration question in their political of europe .indeed , the lego , nord, the dutchvlaams bloc , the french still penova combatants national - all of this are examples parties themselves or movements be held and of from the common averze towards immigrants and pushing the the simplest a view , the solution is to question the immigrants .<br />VLEMMA OUTPUT: race-specific divided europein fact the extreme right is its rasismus and that use immigration is the question in their political favor .indeed , the leganordnizozemskvlaamsblockade , the french still penova national fronts - all of this is happening parties or movement would be held and of from the common averze towards imigrant and pushing of makes it easier to this view , to question the immigrants .<br />
    13. 13. VLemma Output: race-specific divided europein fact the extreme right is its rasismus and that use immigration is the question in their political favor .indeed , the leganordnizozemskvlaams blockade , the french still penova national fronts - all of this is happening parties or movement would be held and of from the common averse towards immigrant and pushing of makes it easier to this view , to question the immigrants .<br />Dictionary<br />
    14. 14. Limitations <br /><ul><li>No use of syntax/POS in sentence reordering
    15. 15. Phenomenon like ‘pronoun dropping’ that occurs in Czech is not tested for accuracy in translations
    16. 16. No Human Cross evaluation for better understanding of the improvement in results
    17. 17. Does not cover the effect of morphology of target language on translations. (Zhang et. al, 2007). </li></li></ul><li>FUTURE DIRECTION<br /><ul><li>Add syntactic information to improve the word reordering and language modeling.
    18. 18. Carry experiments with other languages too.
    19. 19. Test Pipeline of lemmatization to improve word alignment ? But what about syntax? </li></ul>VLEMMA: race-specific divided europein fact the extreme right is its rasismus and that use immigration is the question in their political favor .indeed , the leganordnizozemskvlaamsblockade , the french still penova national fronts - all of this is happening parties or movement would be held and of from the common averze towards imigrant and pushing of makes it easier to this view , to question the immigrants .<br />Lemmatize nouns <br />Source Language<br />English<br />

    ×