Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Enriching Transliteration Lexicon Using Automatic Transliteration Extraction


Published on

HCSNet Summerfest poster 2007

Published in: Technology
  • Login to see the comments

  • Be the first to like this

Enriching Transliteration Lexicon Using Automatic Transliteration Extraction

  1. 1. Enriching Transliteration Lexicon Using Automatic Transliteration Extraction Sarvnaz Karimi School of Computer Science and IT RMIT University Supervisors: Dr. Falk Scholer and Dr. Andrew Turpin Keywords: Transliteration, Parallel Corpus
  2. 2. Machine Transliteration • Machine transliteration transforms a word from a source language to a target language with preserved pronunciation. • Machine translation, cross-lingual information retrieval and cross-lingual question answering are the main areas that automatic transliteration is applicable. • Transliteration has been studied in two major areas: transliteration generation and transliteration extraction. • Transliteration generation gets an input source word in source language (e.g. Sydney in English) and generates its transliteration in target language (e.g. © Â ª in Persian). • Transliteration extraction is discovering transliteration pairs (e.g. (Sydney, © Â ª ) in bilingual texts.
  3. 3. Transliteration Extraction So Far! Discovery of transliteration methods in literature consider: • Extraction from parallel corpus: – Statistical methods are beneficial, particularly because the sentences/words can be aligned. – Yet parallel corpus is hard to find for many less-computerised languages. • Extraction from comparable corpus: – More evidence than just statistical information are required to extract pairs (e.g. temporal, phonetic information or Web-count). – Comparable corpora are easier to construct and find than parallel one. Most studies use name entity (NE) recogniser to separate proper nouns that are subject to transliteration from other words.
  4. 4. Persian and English Transliteration • Transliteration generation has been studied using n-gram based and consonant-vowel based approaches. • Transliteration extraction is not previously studied for this language pair, mainly due to lack of any parallel or comparable corpus. • Transliteration extraction has been studied using co-occurrence, temporal, edit distance measures or phonetic similarities. We aim to apply our transliteration generation methods as a basis for this task.
  5. 5. Proposed Method: Application of Transliteration Generation in Extraction 1. For each document in each language we perform a pre-processing to generate a bag of words from each document (tokenise) and also remove stop-words 2. Each word in source language is matched against a dictionary, if not found then it is an out-of-dictionary word that needs transliteration in target document. 3. A ranked list of possible transliterations for each source word is generated by transliteration system. 4. Those transliterations matching with the target document potential words are considered as a potential pair. 5. A score can be given to these pairs based on the rank of the transliteration and number of times they are paired.
  6. 6. Experimental Setup • An English-Persian comparable corpus of news texts is constructed consisting of 3,474 documents. • An English machine-readable dictionary was applied which contains 120,177 entries. • Experiments: – Accuracy of transliterations extracted (Fixed Training Collection). Different methods of matching experienced (1-English documents are parsed to extract their out-of-dictionary words using dictionary look-up and stemming. 2- A parsing on Persian documents is performed by rendering the words that contain allophones characters to one unique character. 3- Repeating the previous experiment including capital characters knowledge.) – Impact of seed transliteration lexicon.
  7. 7. Experiments and Results Experiment 1 : Accuracy #Pairs #Correct Avg.Rank Rank 1-5 #P #E #doc Lex-Size Lex Pr. 1 4.6 3.6 (81.3) 5.8 68.9 8.5 11.3 1662 2860 70.3 2 4.1 3.6 (90.2) 5.9 69.7 8.5 11.3 1641 2496 80.4 3 6.6 5.9 (89.2) 6.9 66.8 8.4 22.6 1725 3694 75.2 Experiment 2 : Train Size Train #Pairs #Correct Avg.Rank Rank 1-5 #P #E #doc Lex-Size Lex Pr. 200 2.6 2.2 (86.5) 2.8 84.2 9.1 24.0 1322 1287 78.2 300 3.1 2.5 (82.5) 4.2 71.9 8.8 23.5 1494 1579 73.3 400 3.1 2.6 (84.5) 4.6 71.3 8.8 23.4 1483 1569 74.1 500 2.8 2.5 (89.5) 4.3 78.0 8.9 23.5 1459 1507 79.0
  8. 8. Conclusions and Further Work • Transliteration extraction can be helpful in automatically generating transliteration lexicons. • Transliteration lexicon as a dictionary of transliteration of a proper noun or technical terms that are not translated are beneficial in dictionary-based machine translation applications. • We investigated a method of applying the current yet incomplete transliteration lexicons in enriching them using comparable corpora. • In future, role of NE-recogniser will be investigated to compare with a simple dictionary look-up.