Your SlideShare is downloading. ×
ICSOFT 2011 - 6th International Conference on Software and Data Technologiescost effectiveness. In this paper we present t...
ICSOFT 2011 - 6th International Conference on Software and Data Technologies SMTM P =                                     ...
Upcoming SlideShare
Loading in...5

Icsoft 2011 51_cr


Published on

Article from the I

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Transcript of "Icsoft 2011 51_cr"

  1. 1. METHOD FOR AN AUTOMATIC GENERATION OF A SEMANTIC-LEVEL CONTEXTUAL TRANSLATIONAL DICTIONARY Dmitry KanDepartment of Technology of Programming, Saint-Petersburg State University, Universitetsky prosp. 35, Peterhof, Russia dmitry.kan@gmail.comKeywords: Translational Dictionary, Semantic Analyzer, Computer Semantics, Machine Translation, Disambiguation.Abstract: In this paper we demonstrate the semantic feature machine translation (MT) system as a combination of two fundamental approaches, where the rule-based side is supported by the functional model of the Russian language and the statistical side utilizes statistical word alignment. The MT system relies on a semantic- level contextual translational dictionary as its key component. We will present the method for an automatic generation of the dictionary where disambiguation is done on a semantic level.1 INTRODUCTION semantic-level contextual translational dictionary in Section 4. Finally, we present the experimental MTThere are two fundamental approaches to Machine system in Section 5. We list the main features of theTranslation (MT): rule-based approach and presented MT system in Section 6.statistical approach, which is based on streams of Classic MT triangle (cf. Fig. 1) separatesinput data. Each of these two approaches has their semantic and syntactic transfers.advantages: the rule-based approach deeply studiesand formalizes the linguistic rules of a naturallanguage; statistical approach gives an opportunityto rapidly prototype new algorithms with applicationto several different natural languages using datastreams. Along with it, there are as welldisadvantages attributed to each of the twofundamental approaches to MT: rule-based approachcommonly lacks automation of languageformalization process and is usually bound to onenatural language (or very few similar languages);statistical approach generally avoids deep view intothe properties of a natural language giving away the Figure 1: MT triangle adopted from Klueva, 2007.task of language formalization to a numericalalgorithm. In this paper we would like to present an The functional theory of the Russian languageongoing project of an MT system, that merges rule- Tuzov, 2004 however shows that these two levelsbased and statistic approaches in its components are interconnected. A morphological surface maywhere it is possible. point to two or more parts of speech. All of them can Since the main component of the system is be equally considered as candidates on the syntactictranslational dictionary, we prepare the grounds of level. However a word’s semantics is required for aits automatic generation in Section 2. The rule-based successful final resolution of a sentence meaningside of the system is supported by the functional (see Section 3 for more detail).model of Russian language in Tuzov, 2004. It is Bennett, 1990 questions the need of a full-blowndescribed in brief in Section 3. We utilize the results semantic analysis for MT and instead suggestsof Sections 2 and 3 for automatic creation of a advancing the „semantic feature system‟ to achieve 415
  2. 2. ICSOFT 2011 - 6th International Conference on Software and Data Technologiescost effectiveness. In this paper we present the MT стремлении ({2}) удержать ({}) властьsystem, which has semantics coded on the level of ({5}) ,dictionary entries and uses semantic analyzer to ({6}) Первез ({7}) Мушарраф ({8}) отверг ({9 10})represent the meaning of an input sentence. конституционную ({14 15}) The related work Homola, 2009 shows how to систему ({})build a translational dictionary between Chech and Пакистана ({11 12 13}) и ({16})English with the statistical word alignment. объявил ({17}) о ({18})However Homola, 2009 does not provide a method введении ({})of resolving the word ambiguities. The main чрезвычайного ({19 21})challenge during generation of a translation положения ({}) . ({22})dictionary is resolving ambiguity of numerous word The above structure represents a mapping of wordssequence pairs. In this work we suggest an approach in Russian sentence into sequences of words in itswhich allowed us to solve the task by semantic English translation. Table 1 contains the mappinginterpretation of each dictionary entry. for the above example. Table 1: Word alignment for English and Russian sentences.2 TASK OF WORD ALIGNMENT Russian English NULL ofTo build an MT system which uses statistic отчаянном Desperate to holdmodelling of a natural language one needs a parallel стремлении tocorpus. Based on the corpus a model of translation is власть powerbuilt and it contains phrase translational dictionary. , ,In the process of constructing the dictionary, phrases Первез Pervezof the parallel corpus get mapped together through Мушарраф Musharrafmaximizing the probability of their co-occurrence in отверг has discardedthe parallel corpus. Maximization is conducted over конституционную constitutional frameworkall possible word sequences of two aligned sentences Пакистана Pakistan ´ sin two languages (cf. Fig. 2). и and объявил declared о a чрезвычайного state emergency . .Figure 2: Example of one word-level alignment of Englishand French sentences adopted from Och, 1999. 3 COMPUTER SEMANTICSThe algorithm of word alignment is described in Al- According to the theory in Tuzov, 2004 any naturalOnaizan et al, 1999 and Och, 1999, while the full language is functional in strict mathematical sense.mathematical formulation of statistical MT is given Each sentence can be represented in a form ofin Brown, Della Pietra, Della Pietra and Mercer, superposition of its functions-words:1993. The toolset Moses by Koehn, 2007 allowsbuilding a statistical MT system and includesGIZA++ as one of its component. GIZA++ S = F ( f1 ( w11 ,..., w1k ),..., f n ( wn1 ,..., wnl )), (1)implements the above algorithms for word wij ≠ whm , ∀i ≠ h, j ≠ malignment and outputs the following structure foreach pair of sentences in the parallel corpus: Definitional domain and values domain of F,f1,..,fnDesperate to hold onto power , Pervez belong to reality. In (1) word inequalities mean, thatMusharraf has we count each word only once. This holds even ifdiscarded Pakistan s constitutional there are several repetitions of a word with the sameframework and or different semantics in the input sentence. In thedeclared a state of emergency . process of a sentence analysis, semantic analyzerNULL ({20}) В ({}) implemented by Tuzov, 2004 operates with the wordотчаянном ({1 3 4}) senses extracted both from the hierarchical ontology416
  3. 3. METHOD FOR AN AUTOMATIC GENERATION OF A SEMANTIC-LEVEL CONTEXTUAL TRANSLATIONAL DICTIONARYwith more than 2000 classes and from the syntactic- English analogue. One important property of thesemantic dictionary. Since a dictionary word is dictionary is that its entries are context dependent.generally represented as n-ary function with This is provided by two circumstances: 1) eachsemantic constraints, its final semantics in the sentence in Russian had its expert translation intosentence depends on the exact words in its context English and 2) each Russian word has beenwithin the sentence. The Tuzov’s syntactic-semantic attributed with a semantic formula that was a resultdictionary contains more than 150, 000 semantic of semantic assembling of the correspondingformulas. Russian sentence. Consider one entry of the above extract in more detail:4 TRANSLATIONAL В Y1>HabU(Y1:,ПРЕД:Z1) DICTIONARY <149>--->Within The semantic formula on the left of ---> sign hasFor creation of the MT system that is based on the several components: the word “В” (the Russianfunctional theory in Tuzov, 2004 we need to preposition with a lot of meanings, roughlytranslate the semantic dictionary onto the target corresponding to the English prepositions in, at,language. For the process automation we have within, into, to, of etc); the basis function HabU(x,y)chosen the method described in the Section 2. For with its arguments (the function defines that xthe parallel corpus we have used list of parallel possesses y), which in the case of HabU are Y1 andsentences in Russian and English from the package Z1 (prepositional case); sign followed by the orderUMC Klyueva and Bojar, 2008. GIZA++ has number of the semantic alternative in the semantic-generated 1,3 million of phrase pairs, including syntactic dictionary.duplicates. Applying the semantic analyzer on each Another example:Russian sentence we have obtained the semantic Y1>Loc(Y1:,ВНУТРИ$12/313/05(ПРЕД:Z1))alternatives of each of the words in the built pairs, <146>--->atwhich correspond to their local context. As a result,each word in the original translational phrase The second argument of Loc(x,y) basis functiondictionary was substituted with its semantic formula. (defines, that x is located in / at y) is itself aThis solves the disambiguation on semantic level. function-preposition that takes one argument inAfter removing the duplicates from the modified prepositional case. The second argument has thedictionary, the final version of semantic-level name “ВНУТРИ” which is appended withcontextual translational dictionary has been built ontological class number $12/313/05. In this classwith about 18, 000 word pairs. The dictionary is number, 12 refers to physical objects, 313 refers tosubject to further clean up procedures and inhabited locality, 05 refers to physical position ofenrichment. Here is an extract from the final an object in prepositional case which is expressed asdictionary: adverb in Russian (“ВНУТРИ” is both “where?”В Y1>HabU(Y1:,ПРЕД:Z1) <149>--->Within and “how?”).В Y1>Loc(Y1:,ВНУТРИ$12/313/05(ПРЕД:Z1)) <146>--->atВ Y1>Loc(Y1:,Oper01(#,ПРЕД:Z1)) <208>--->In 5 EXPERIMENTAL MT SYSTEMВ Y1>Loc(Y1:,ПРЕД:Z1) <224>--->Throughout...НА Y1>Direkt(Y1:,ВЕРХ$12/141/05(ВИН:Z1)) The semantic-level translational dictionary obtained<67>--->at in the Section 4 forms the ground for theНА Y1>Direkt(Y1:,РОД:Z1) <100>--->on experimental Russian to English MT system. InНА Y1>Direkt(Y1:,РОД:Z1) <69>--->for order to achieve fluency on the target language side...ОБРАЗ (РОД:Z1) <2>--->a way and to reduce the noise in the automaticallyОБЩЕМИРОВОЙ generated semantic-level translation dictionary, weA1>Rel(A1:НЕЧТО$1,ПОЛНЫЙ$12/207/05(МИР$1227) have devised the Semantic Machine Translation) <1>--->global Model (SMTM) for translating sentence P onto the... target language L2: Each dictionary entry contains semantic formulacorresponding to the original Russian word and its 417
  4. 4. ICSOFT 2011 - 6th International Conference on Software and Data Technologies SMTM P = applying the method described in Section 4. , (2) arg max Ω (t ,..., t ) = arg max ∑ δ i (tk , tl ) S s i 1 m We have presented the MT system that can be i =1,n k =1,m −1 l =2 ,m i categorized as a semantic feature system according to Bennett, 1990, which has complex semanticwhere: analysis system under the hood. 1, t k tl ∈ L Mδ i (t , t ) = { S 2 (3) k l 0, t k tl ∉ L M REFERENCES 2Index S in (2), (3) suggests that the definitional Al-Onaizan Y., Curin J., Jahr M., Knight K., Lafferty J., S S Melamed D., Och F. J., Purdy J., Smith N. A.,domain of functions δ and Ω is the set of Yarowsky D. 1999. Statistical Machine Translation. i i Final report. Statistical Machine Translation, JHUsemantic formulas coded with symbols tl ∈ L2 in Workshop.the translational dictionary. L2M in (2) is the Bennett W. S. 1990. How Much Semantics is Necessarystatistical model of L2. The translation algorithm for MT systems? Proceedings of the third Internationalstarts with translation of an input Russian sentence Conference on Theoretical and Methodological Issuessemantic representation. It then extracts the in Machine Translation of Natural Languages, Linguistic Research Center, University of Texas,calculated semantic alternatives for each of the Austin, TX:261-270.words and maps them onto their English Brown P. F., Della Pietra V. J., Della Pietra S. A., Mercertranslations. Form the final sentence in English R.L. 1993. The Mathematics of Statistical Machineusing the model (2). Some of the translations show Translation: Parameter Estimation. Computationalthe advantage of semantic processing over statistical linguistics, 19(2):263-311.modelling Moses by Koehn, 2007 in that the system Klueva N., Bojar O. 2008. UMC 0.1: Chech-Russian-picks the correct semantic alternatives of polysemic English Multilingual Corpus. Proceedings of thewords and their correct translations. The conference “Corpora 2008”, Saint-Petersburg.experimental MT system also translates more words Klueva N. 2007. Semantics in Machine Translation. WDS’07 Proceedings of Contributed Papers, Partthan Moses from Russian to English, because it I:141-144.operates with the word lemmas while Moses is Koehn P. et al. 2007. Moses: Open Source Toolkit forsensible to word surfaces. Statistical Machine Translation. Annual meeting of the Association for Computational Linguistics, demonstration session, Prague, Chech Re-public.6 FEATURES OF THE Och F. J. 1999. An Efficient Method for Determining Bilingual Word Classes. Ninth conf. of the Europ. EXPERIMENTAL MT SYSTEM Chapter of the Association for Computational Linguistics, EACL’99:71-76.The presented MT system is in its early stage of Tuzov V., 2004. Computer Semantics of the Russianincorporating the semantic component into the Language, Saint-Petersburg State Universityprocess of MT. We have utilized statistical methods Publishing House. Saint-Petersburg.for automating the construction of a semantic-leveltranslational dictionary and functional theory ofnatural language to resolve the ambiguity andintroduce semantic context. Here is the list of themain features of the MT system: • Dictionary entries contain semantic attributes of the Russian words; • Each entry represents a sample of a context extracted using statistical word alignment and coded with the corresponding semantic formula; • The MT system is automatically extendable through acquiring new parallel corpora and418