A tutorial on Machine Translation


Published on

A tutorial on Machine Translation

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

A tutorial on Machine Translation

  1. 1. National Conference on Translation, Dept. of Hindi, University of Kerala, February 23-24- 2010 Man to Machine A tutorial on the art of Machine Translation Jaganadh G jaganadhg@gmail.com http://jaganadhg.freeflux.net/blog1 Introduction Machine Translation(MT), is a sub-field of computational linguistics that investigates theuse of computer software to translate text or speech from one natural language to another. It isinteresting to think about an MT system that can translate literary works from one language toanother language. To enjoy the novel Anything for you, maam1; just feed the novel in to a MTsystem and get it translated to your language. Such kind of MT systems are supposed to break thelanguage barrier. MT can help us to over come the technological barrier too. The drasticdevelopments Information Communication Technology(ICT) lead to information overflow through theinternet. But this information is available only in a very small sub set of languages. It is notreachable for for a significant portion of users/people. This particular phenomena is called as “digitaldivide”. Lots of information is available in the internet in English language; but the same informationmay not be available in our vernaculars like Hindi or Malayalam. In the case of India only 3% of thepopulation can understand English2. In a country like India achievements in the field of R&D in MThas great significance. In-short MT helps the world to be united both intellectually as well asculturally. To achieve this task we have to do lots of exercises both in the field of language andlinguistics and computer science. The present tutorial is an introduction to the art of MT. Thismaterial is compiled with the help of some already published literature in the field. The main sourcesof this tutorial is mentioned in the reference section. The tutorial is just a theoretical over view ofthe field.1.1. History of MT The history of MT starts from early 1950s. But some hypothetical historic concepts existedbefore the period. In 17th cent. Two philosophers Leibniz and Descartes put forward proposal forcodes which could relate words between languages. But still the proposal remains as theory only.The first proposal for developing MT were put forwarded by Warren Wever, a researcher atRockefeller Foundation in 19493. After a few years actual research in the field of MT started at manyuniversities in the United States. The first public demonstration of an MT system was held on 7 thJanuary 1954 and at the head office of IBM. It is known as G eorgetown-IBM experiment. Thesystem was a kind of toy system, having just 250 words and translating just 49 carefully selectedRussian Sentence in to English. Many institutions inside the US was very active in the R&Dactivities related to MT and the government was very much supportive to it. In 1964 the USgovernment constituted a committee to evaluate the progress in MT research. The committee wascalled Automatic Language Processing Advisory Committee(ALPAC). They concluded that MT ismore expensive, less accurate and slower than the human translation, and that despite the expenses.MT is not likely to reach the quality of a human translator in near future. But they recommendedthat tools to be developed to aid the translators like automatic dictionaries, and research inComputational Linguistics(CL) should be continued. It created a deep impact in the MT researchers.Mt research was abandoned for a short duration. But the field raised like a phoenix bird andsignificant developments are there. Mt research is very active in Indian Languages(IL) too.1 http://www.raheja.org/2 Sinha, R.M.K and A. Jain, 2003, Angalahindi: An English to Hindi machine translation system, Proceedings of the MT SUMMIT IX, Orelands, LA, pp.23-27.3 http://en.wikipedia.org/wiki/History_of_machine_translation
  2. 2. National Conference on Translation, Dept. of Hindi, University of Kerala, February 23-24- 20102 Machine Translation Translation can be defined as the act or process of translating, especially from one languageinto another. We know that producing high quality translation is difficult for human translators too.A translator should posses knowledge of Source Language(SL), Target Language(TL) and itsgrammar and culture etc.. Even if one posses all such knowledge we cant ensure that the person canproduce high quality translation. Because natural language is ambiguous. Even for the termtranslation have four different meaning in different context. So the selection of word meaning whiletranslating from SL to TL requires context knowledge etc... Lets see how it can be made possiblewith computers.2.1 Approaches in MT Approaches in MT can be classified into four categories: 1) Direct MT 2) Rule-based MT 3) Corpus-based MT 4) Knowledge-based MT Machine Translation Rule based MT Corpus based MT Direct MT Knowledge-based MT Transfer based MT Interlingua based MT Example based MT Statistical MT Fig.1. Machine Translation Approaches Each of the approaches which mentioned above have its own advantages and disadvantages.A brief note on the approaches are given below.2.1.1 Direct Machine Translation As the very name suggests, direct MT systems provides direct translation. No intermediaterepresentation or complex architecture will be involved in the approach. It carries out word by wordtranslation with the help of a bilingual dictionary, usually followed by some syntactic re arrangement.It involves little analysis of SL text, no parsing and mainly relays on the quality of bilingualdictionary. Some minimal syntactic re arrangement etc.. only will be there in the system. A generalflow of a direct MT system is like: 1) Remove morphological inflection from the SL words 2) Look up a bilingual dictionary to get the corresponding TL word 3) Perform necessary syntactical rearrangemnts
  3. 3. National Conference on Translation, Dept. of Hindi, University of Kerala, February 23-24- 2010 SL TL SL words Morphologic words Bilingual words Syntactic SL text dict lookup rearrangement TL text al Analysis SL TL dictionary Figure 2. Direct machine Translation System Consider the example Sita slept in the garden. Lets see how it will be translated to Hindiwith a direct MT system. Input (Englisg Sentence) - Sita slept in the garden. Words translation – सीता सोयियि में बाग Syntactic rearrangement - सीता बाग में सोयियि । Besides simple word translation and ordering, suffix handling and preposition handling isneeded to make the translation acceptable. It is called as idiomatization.Consider the example : English Sentence - The boy gave the girl a flower. Word Translation - लड्का दी लटकी एक फू ल Syntactic rearrangement - लड्का लड्की एक् िकताब दी Idiomatization - लड्क ने लड्की कोय एक फू ल दी। े Modification of verb and adjective according to the gender of the subject is also required ifthe TL has such constrains. In languages like Hindi such kind of grammatical phenomena has totaken care to produce quality translation. E.g. English Sentence - She saw stars in the sky. Word Translation - वोय देखा तारे में आसमान Syntactic rearrangement - वोय आसमान में तारे देखी Idiomatization - उसने आसमान में तारे देिख । To attain such a great quality in direct MT is very difficult if the SL and TL does not sharenear syntactical as well as morphological phenomena. For a Hindi to English or English to Hinditranslation system, such a word by replacement and idiomatization will not produce understandableMT output. Such kind of MT output is called as word salad.The major limitations for this MT approach is : 1) Does not considers the structure and relationship between words 2) There is no attempt to disambiguate the sense. Majority of words in our natural language
  4. 4. National Conference on Translation, Dept. of Hindi, University of Kerala, February 23-24- 2010 are ambiguous. For example the Hindi word खाना is a verb denotes the activity of eating. When an adjective is preceded the meaning will be totally changed. Eg. बडा खाना . 3) No adaptability -The system which is developed for a particular language pair will not be suitable for another language pair.2.1.2. Rule-based MT(RBMT) The rule based approach in MT is pretty much advanced than the direct MT approach. Thesystem relays on hand made linguistic rules for performing the MT process. There are two types ofrule-based MT approaches are there 1) Transfer-based MT and 2) Interlingua based MT .2.1.2.a. Transfer-based MT Int this approach the SL text is analyzes the SL text to produce a representation thatmatches the rules of the target language. It requires the understanding of difference between the SLand TL. A typical flow of RBMT is like 1) Analysis of SL text [syntactical] 2) Transfer – Transfer the SL syntactic structure to TL syntactic structure. 3) G eneration – generate TL text with defined rules. SL TL repres represen entatio tation Analysis n Transfer Synthesis SL text TL text SL SL – TL TL Grammar dictionary grammar Figure -3 . Diagram of transfer-based MTWe can workout the system with our previous example Sita slept in the garden. Input - Sita slept in the garden Analysis output - (S (NP (NNP Sita)) (VP (VBD slept) (PP (IN in) (NP (DT the) (NN garden))))) After Syntactical transfer - (S (NP (NNP Sita)) (VP (PP (NP (DT the) (NN garden)) (IN in) ) (VBD slept) )) Hindi lexicalization - (S (NP (NNP सीता)) (VP (PP (NP (NN बाग)) (IN में) ) (VBD सोयियि) )) Hindi Sentence - सीता बाग में सोयियि । The main advantage of the system is its modular structure. Analysis of SL text is
  5. 5. National Conference on Translation, Dept. of Hindi, University of Kerala, February 23-24- 2010independent of the TL text generator system. Another notable advantage of the system is itscapability to disambiguate the word sense even in lexical level ambiguity too. For example theEnglish word book falls in two parts of speech (POS) category i.e noun and verb. This approachcan handle such kind of lexical ambiguity up to certain extent. But the major disadvantage of thesystem is related to its adaptability or extensibility for a group of language pairs. If we are trying todevelop a system for English to Hindi and Malayalam to Hindi we have to have to SL analyzers.2.1.2.b Interlingua-based MT In interlingua based approach, the SL will be converted in to a language independentmeaning representation called interlingua. From this interlingual representation, the TL text can begenerated. In short the translation in this approach is a two-stage process, i.e analysis and synthesis. SL text Interlingua Analysis representation TL text TL synthesis Figure. 4. Model of interlingua based MT The flow of the system is very clear from the above given diagram itself. The system willreceive the input and performs SL analysis. This analysis is SL specific. The effort required todevelop and interlingua based machine translation system is much more than the transfer basedapproach. The major source of difficulty in using this approach is defining a universal and abstractinterligual representation. A sample interligua representation for the sentence Sita slept in thegarden is given below. (*sleep (tense past) (mood declarative) (punctuation period) (subject (*Sita (number singular) (Location (*garden (reference definite) (number singular))) Sample interlingua for the sentence Sita slept in the garden2.1.3. Corpus Based MT Corpus is a large collection of text or speech in a language. In recent years there is anincreased interest in corpus based MT systems. Because it needs less effort form the side oflanguage/linguistic experts and less human effort is required. On the contarary they require largeamount of sentence aligned parallel corpus. The corpus based approach can be classified in to two1)statistical MT(SMT) and 2) example based MT (EBMT).2.1.3.a. SMT The SMT is inspired by the noisy channel used in Automatic Speech Recognition(ASR).The noisy channel model introduces noice that which makes it difficult to recognize the input word.A recognition system based on this builds a model of channels to identify how it modifies the input
  6. 6. National Conference on Translation, Dept. of Hindi, University of Kerala, February 23-24- 2010and recover the original of the word. An SMT system models a TL sentence T, given a Sl sentence S, as the product oftranslation probability P(S|T) and TL probability P(T). The translation probability P(S|T) accountsfor the adequecy of translation contents, where as P(T) accounts for fluency of target construction.The basic view behind the SMT is that every sentence in a language has a possible translation inother language; a sentence in one language can be translated to another language in many ways. Thischoice is translator specific one. Language Translation S T Model P(S|T) Model P(T) S Decoder T Figure 4. Noisy channal model for Englidh to Hindi MT Lets consider the example of English to Hindi SMT system. Every Hindi sentence h is apossible translation of an English sentence e. The probability that गायि खास खाता है । is translation ofMurthy eats apple is low as compared to the probability of रिव खाना खाता है being the translationof the sentence. Every pair of sentence (e,h) a probability, P(h|e), which is the probability that atranslator when presented with an English sentence e, will produce h as its Hindi translation. We canassume that when a native speaker of Hindi produces an English sentence he will be having a Hindisentence in mind and will be translating it in to English mentally. The goal of SMT is to find thesentence h that the native speaker in his mind when he produces e. The noisy channel model canbe described like P(h|e) = P(e,h)/ P(e) = P(h) x P(e,h) / P(e) The two components inSMT are Language Model(LM) and Translation Model(TM). Alanguage model gives the probability of a sentence. These probabilities are calculated with N-G 4 ramtechniques. The translation model helps to compute the conditional probability P(e|h). it is trainedfrom a parallel corpus of English/Hindi pairs. This section is just a birds eye view of the SMTtechniques. Due to time constrains the section on SMT is concluding with this introductory remarkson SMT. Some Free and Open Source (FOSS) tools are available now to experiment with the SMTtechniques5.2.1.3.b. Example-based MT(EBMT) The EBMT system uses past translation examples to generate translation for a given SLtext. EBMT systems maintains an example-base consisting of translation examples between sourceand target languages. When a SL sentence is given to the system, the system retrieves a similar SLsentence from the example-base and its translation. Then it adapts the example to generate the TL4 http://en.wikipedia.org/wiki/N-gram5 www.apertium.orgwww.statmt.org/moses
  7. 7. National Conference on Translation, Dept. of Hindi, University of Kerala, February 23-24- 2010sentence of the input sentence. The EBMT system rest on the idea that slimier sentence will be haveslimier translations. The system has two main modules 1)retrieval and 2) adaption. SL sentence Example aptterns TL sentence Retrival Adaption Example base Adaption rules/ SL-TL dictionary Figure 5. Example based MT The task of retrial module is to retrieve translation examples from already stored example-base. This module tries to retrieve an example from the base which is slimier to the input sentence.The adaption module is responsible for carrying out the necessary modifications in the retriedexample to generate the TL sentence. This modification may involve addition, deletion, insertion ofmorphological words, constituent words or suffixes. Lets elaborate the concept with the help of an example. Consider English- Hindi transaltionfor the following input sentence: Input - Santhosh is writing a letter. Example base - Vikram wrote a poem. (1) Anand is writing. (2) Ravi is writing an essay. (3) Mukesh writes a Malayalam poem. (4) Selection by the retriever Ravi is writing an essay रिव एक उपन्यिास िलख रहा है । Using this retrieved pair the system swill replace Ravi with Santhosh and उपन्यिास with पत्र inTL translation.2.1.3 Knowledge-based MT(KBMT) The MT systems which we seen so far uses either a morphological or syntactical or someextent of semantic knowledge to translate SL text in to TL. Even though the IL system uses somesort of semantics the central concept is syntactic analysis. Semantic based language analysis has beenintroduced by Artificial Intelligence(AI) researchers. This approach requires a large amount ofontological and lexical knowledge. The KBMT approach includes semantic parsing, lexicaldecomposition in to semantic networks and resolution of ambiguities and uncertainties by reference ofknowledge-base.
  8. 8. National Conference on Translation, Dept. of Hindi, University of Kerala, February 23-24- 2010 person ::= (person (isa creature) (agent-of (Eat, Drink, Move, Attck, Love ....)) (consists-of(Hand Foot, ....))) computer-user ::= (computer-user (isa person) (agent-of (+(Operate))) (subworld computer-world)) Example of an ontology for KBMT system3. Machine Translation Evaluation Many online MT systems are available for the general public. One of the most famousonline MT service is the Gogle Translate service6. Have you ever tries the Hindi to English or oEnglish to Hindi translate service of Gogle? If not just try it out and have a fun!!! o Evaluation of MT is a harder task than developing an MT systems. Or we can say equaleffort is required to evaluate MT. Why MT evaluation is crucial? Because what a consumer expectsfrom a commercial MT project is high quality translation. The aim of MT evaluation is to measurehow accurately an MY system can handle the phenomena included in translation from SL to TL.Consider that you are giving the sentence I like milk as input to an MT system; it produces मैं दूधजैसा हूं instead of मुझे दूध पसं द है . What will your reaction? Definitely you will tell that the MTsystem is waste!! Obviously an MT system may translate this sentence in to Hindi in the followingways मुझे दूध पसं द है दूध मुझे पसं द है मैं दूध जैसा हूंExcept the third translation everything else is acceptable. Many MT evaluation techniques were developed by the researchers. Among them theBLUE7 , METROR8 and NIST9 metrics are widely used. These are automatic MT evaluationmethods. Besides this the effective method is human-evaluation. But the disadvantage of humanevaluation is that it is time consuming and costly! The automatic metrics are not that much effectivein the case of all the language pairs. Adaptability of BLUE metric in English to Indian language isunder study and some results and observations are already available 10.4 MT Research in India MT research in started in the dawn of 1970 and the beginning of 1980s. The major projectsin MT system developments are carried out in IIT Kanpur, Central University of Hydrabad, IIITHydrabad, AU-KBC Research Center Chennai, C-DAC, IISC Kolkatta and Tamil Virtual UniversityThanjavur. The earlier system developed for English to Hindi is Anglabharati and anusaarak systemfrom IIT Kanpur. A list of MT projects in India is given below.6 http://translate.google.com/7 http://en.wikipedia.org/wiki/Bilingual_evaluation_understudy8 http://en.wikipedia.org/wiki/METEOR9 http://en.wikipedia.org/wiki/NIST_(metric)10 http://www.cse.iitb.ac.in/~pb/papers/icon07-bleu.pdf
  9. 9. National Conference on Translation, Dept. of Hindi, University of Kerala, February 23-24- 2010Name of the MT Project Name of R&D center Language pairAnglabharati IIT Kanpur English to Indian languagesAnubharati Indian Language to EnglishAnusaarak IIT Kanpur, Central Univ. of English to Hindi, IL to IL Hydrabad, IIIT hydrabadMaTra C-DAC Mumbai English to Indian LanguagesMantra C-DAC Pune English to HindiUNL based MT IIT Bombay English to Hindi, MarathiTamil Hindi anusaarak AU-KBC Chennai Tamil to HindiEnglish Tamil MT English TamilShakti IIIT Hydrabad English HindiSampark IL to IlBeyond these project industry giants like IBM and Micrsoft are also engaged in English to Hindi MTsystem development.5. References[1] Natural Language Processing and Information Retrieval, Tanveer Siddiqui, U S Tiwary, OxfoardUniversity Press, Delhi, India, 2008.[2] Speech and Language Processing, Daniel Jurafsky and James H. Martin, Prentice Hall, 2009.[3] Foundation of Statistical Natural Language Processing, Chris Manning and Hinrich Sch ütze,MIT Press. Cambridge, MA: May 1999.[4] Statistical MT tutorial www.isi.edu/natural-language/mt/wkbk.rtf Accessed on 12-02-2010.[5] Automatic Translation of Languages, http://www.mt-archive.info/Bar-Hillel-1960.pdf Accessedon 15-02-2010.[6] An Introduction to Machine Translation, http://www.hutchinsweb.me.uk/IntroMT-TOC.htm,Accessed on 01-02-2010.Note: Some of the examples and diagrams which used in this document is either directlyadapted from the the book Natural Language Processing and Information Retrieval [1].Some modifications were made in certain examples.