Enriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using
Automatic Transliteration Extraction
School of Computer Science and IT
Supervisors: Dr. Falk Scholer and Dr. Andrew Turpin
Keywords: Transliteration, Parallel Corpus
Transliteration Extraction So Far!
Discovery of transliteration methods in literature consider:
• Extraction from parallel corpus:
– Statistical methods are beneﬁcial, particularly because the
sentences/words can be aligned.
– Yet parallel corpus is hard to ﬁnd for many less-computerised
• Extraction from comparable corpus:
– More evidence than just statistical information are required to extract
pairs (e.g. temporal, phonetic information or Web-count).
– Comparable corpora are easier to construct and ﬁnd than parallel one.
Most studies use name entity (NE) recogniser to separate proper nouns that
are subject to transliteration from other words.
Persian and English Transliteration
• Transliteration generation has been studied using n-gram based and
consonant-vowel based approaches.
• Transliteration extraction is not previously studied for this language pair,
mainly due to lack of any parallel or comparable corpus.
• Transliteration extraction has been studied using co-occurrence, temporal,
edit distance measures or phonetic similarities. We aim to apply our
transliteration generation methods as a basis for this task.
Proposed Method: Application of Transliteration
Generation in Extraction
1. For each document in each language we perform a pre-processing
to generate a bag of words from each document (tokenise) and
also remove stop-words
2. Each word in source language is matched against a dictionary,
if not found then it is an out-of-dictionary word that needs
transliteration in target document.
3. A ranked list of possible transliterations for each source
word is generated by transliteration system.
4. Those transliterations matching with the target document
potential words are considered as a potential pair.
5. A score can be given to these pairs based on the rank of the
transliteration and number of times they are paired.
• An English-Persian comparable corpus of news texts is constructed
consisting of 3,474 documents.
• An English machine-readable dictionary was applied which contains
– Accuracy of transliterations extracted (Fixed Training Collection).
Different methods of matching experienced (1-English documents are
parsed to extract their out-of-dictionary words using dictionary look-up
and stemming. 2- A parsing on Persian documents is performed by
rendering the words that contain allophones characters to one unique
character. 3- Repeating the previous experiment including capital
– Impact of seed transliteration lexicon.
Conclusions and Further Work
• Transliteration extraction can be helpful in automatically generating
• Transliteration lexicon as a dictionary of transliteration of a proper noun or
technical terms that are not translated are beneﬁcial in dictionary-based
machine translation applications.
• We investigated a method of applying the current yet incomplete
transliteration lexicons in enriching them using comparable corpora.
• In future, role of NE-recogniser will be investigated to compare with a
simple dictionary look-up.