Machine translation using mutiplexed pdt for chatting slang


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Machine translation using mutiplexed pdt for chatting slang

  1. 1. INTERNATIONALComputer Engineering and Technology ENGINEERING International Journal of JOURNAL OF COMPUTER (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME & TECHNOLOGY (IJCET)ISSN 0976 – 6367(Print)ISSN 0976 – 6375(Online) IJCETVolume 4, Issue 2, March – April (2013), pp. 125-134© IAEME: Impact Factor (2013): 6.1302 (Calculated by GISI) © MACHINE TRANSLATION USING MUTIPLEXED PDT FOR CHATTING SLANG Rina Damdoo Department of Computer Science and Engineering Ramdeobaba C. O. E. M. Nagpur, MS, INDIA ABSTRACT This article extends my work, a pioneering step in designing Bi-Gram based decoder for SMS Lingo. SMS Lingo is a language used by young generation for instant messaging for chatting on social networking websites called chatting slang. Such terms often originate with the purpose of saving keystrokes. In last few decades, a significant increment in both the computational power and storage capacity of computers, have made possible for Statistical Machine Translation (SMT) to become a concrete and realistic tool. But still it demands for larger storage capacity. My past work employs Bi-Gram Back-off Language Model (LM) with a SMT decoder through which a sentence written with short forms in an SMS is translated into long form sentence using non-multiplexed Probability Distribution Tables (PDT). Here, in this article the same is proposed using multiplexed PDT (a single PDT) for Uni-Gram and Bi-Gram, so smaller memory requirements. Use of N-Gram LM in chatting slang with multiplexed PDT, is the objective behind this work. As this application is meant for small devices like mobile phones we can prove this approach a memory saver. Keywords: Statistical Machine Translation (SMT), Bi-gram, multiplexed Probability Distribution Table (PDT), Parallel alligned corpus, Bi- Gram matrix. I. INTRODUCTION While messaging SMS one tries to type maximum information in single SMS. This practice has evolved a new language, SMS Lingo. Internet users have popularized, Internet slang or chatting slang or netspeak or chatspeak, a type of slang that many people use for texting on social networking websites to speed up the communication. Very few people, now a 125
  2. 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEMEdays write “you” for “you” than “u” for “you”. Such terms often originate with the purpose ofsaving keystrokes. Secondly, young generation does not pay attention to grammar, like instead writing “Iam waiting ”, they write ”am waiting” or “I waiting” or “me waiting”. Thirdly, the consequence of using this casual language is, word based translationmodel does fail if a person uses same abbreviation for more than one word. Because from thedata corpus collected, it is observed one writes same abbreviation “wh”, sometimes for“what”, sometimes for “where”, sometimes for “why”, sometimes for “who”, so to get thecontext clearer the earlier and/or later words also must be considered. In short a contextanalysis evaluation should be made to choose the right definition [1, 2, 3, 6]. Table I givessome sample abbreviations with their expanded definitions. Figure 1 shows, chatting slang in an example session of two persons on socialnetworking websites. Both user A and B are typing short text, but the end user is able to seethe long form text, which increases the readability. Like this, text normalization [1], patentand reference searches, and various information retrieval systems, kids’ self learning can bemain applications of this kind. TABLE I. SAMPLE ABBREVIATIONS WITH THEIR MULTIPLE EXPANDED DEFINITIONS Abbreviation Expanded Definitions lt Let, Late the The, There, Their n In, And me Me, May wer Were, Wear dr Dear, Deer, Doctor Figure 1. Example session of two persons on internet. 126
  3. 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME Our earlier work [4, 5] employs Bi-Gram LM with Back-off SMT decoder for templatemessaging, through which a sentence written with short forms sentence (S) in an SMS is translatedinto long form sentence (L) using non-multiplexed Probability Distribution Tables (PDT), Bi-GramPDT and Uni-Gram PDT. The software performs following steps: • Data corpus collection • Preprocessing the corpus • Training the LM: o Generating Uni-Gram and Bi-Gram PDTs • Testing the LM: o Using Uni-Gram and Bi-Gram PDTs o Using Back-off decoder to expand short SMS to Long SMS. • Evaluating LM with performance and correctness measures: o Precision, Recall and F-factor While working on this project, it was experienced, PDT designing and generation is the mostimportant phase in the project. Because, as this application is meant for small devices like mobilephones memory usage is of most concern. In this article, work using multiplexed PDT (Single PDT) forUni-Gram and Bi-Gram is presented. This article is organized as follows. Section 2 describes N-Gram based SMT system. Section 3presents the generation of separate Uni-Gram and Bi-Gram PDT for a Language Model. In section 4,work of generating multiplexed PDT for Uni-Gram and Bi-Gram is proposed. Section 5 briefsexperimental setup. Section 6 outlines experimental results, finally followed by conclusions for thisapproach.II. N-GRAM BASED SMT SYSTEM Among the different machine translation approaches, the statistical N-gram-based system [2, 7,8, 12, 18, 19] has proved to be comparable with the state-of-the-art phrase-based systems (like theMoses toolkit [17]). The SMT probabilities at the sentence level are approximated from word-based translationmodels that are trained by using bilingual corpora [14]., and in an N-Gram LM N-1 words are used topredict the next Nth word. In Bi-Gram LM (N=2) only previous word is used to predict the currentword. SMT has two major components [3, 9, 14, 15]: • A probability distribution table • A Language Model decoder A PDT captures all the possible translations of each source phrase. These translations are alsophrases. Phrase tables are created heuristically using the word-based models [16]. The probability of atarget phrase, if f is source and e is target language is given as: count (e, f) P(L |S )= P(e | f ) = count (f) count (too, 2) P( too | 2) = count (2) = 2 / 10 = 0.20 127
  4. 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME This means, Uni-Gram ‘2’ is present ten times in the collected corpus out of whichonly twice it represents ‘too’. In an N-gram model the probability of a word is approximatedgiven all the previous n words by the conditional probability of all the preceding wordsP(wn|w1n-1). The Bi-Gram model approximates the probability of a word given the previousword P(wn|wn-1). count (wn-1 wn ) P(wn|wn-1 ) = count (wn-1w) We can simplify this equation, since sum of all Bi-Gram counts that start with a given wordwn-1 must be equal to the Uni-Gram count for that word wn-1. count (wn-1 wn ) P(wn|wn-1 ) = count (wn-1) TABLE II SOURCE AND TEST DATA Long Form language (L) / Short Form language (S) / Target Language (e) Source Language (f) I wnt 2 mt u. W are u? W r u dng? I wan 2 mt u. Whe r u? Wht r u doing? I want to meet u. Wh are u? What r u doing? I wan 2 met u. Where are u? Wht r u dong? I want 2 meet u. W r u? What are I want to meet you. Where are you dng? you? What are you doing? I wnt to mt you. W r u? W are you dong? I wan 2 mt you. Whe are u? Wht are you doing? I wan to meet you. Wh r you? What r you doing? I wan to met you. Where r u? Wht are you dong? I want 2 met you. W r u? What are you dong? 128
  5. 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME The unsmoothed (there are no unknown words). Maximum likelihood estimate of Uni-Gramprobability can be computed by dividing the count of the word by the total number of word tokens N.[8, 15] count (w) P(w ) = ∑ count (wi) i count (w) P(w ) = N As probabilities are all less than 1, the product of many probabilities (Probability chain rule)gets smaller, the more probabilities one multiply. This causes a practical problem of numericalunderflow [13, 15]. In this case it is customary to do the computation in log space, take log of eachprobability (the logprob) in computation. But the PDT, still contains the probabilities or word counts.III. PHRASE TABLE OR PROBABILITY DISTRIBUTION TABLE WITHOUTMULTIPLEXING Due to the lack of enough amount of training corpus word probability distribution [11, 15] ismisrepresented. In a back-off model if order of word pair is not found within the definite context intraining corpus the higher N-gram tagger is backed off to the lower N-gram tagger. The result is aseparation of a Bi-Gram into two Uni-Grams.A. Uni-Gram PDT: Table III shows Uni-Gram PDT for the corpus in Table II. For this corpus N =120. From thisPDT, it is observed Uni-Gram ‘u’ occurs 18 times in the collected corpus in Table II, hence has thehighest probability of 0.15. TABLE III.UNI-GRAM PDT FOR THE CORPUS IN TABLE-II Uni-Gram Uni-Gram Uni-Gram Uni-Gram Probability Probability (w) count(w)/ N (w) count(w)/ N i 10/120 =0. 083 where 2/120=0. 016 want 3/120=0. 025 w 6/120=0. 05 wnt 2/120=0. 016 wh 2/120=0. 016 wan 5/120=0. 041 whe 2/120=0. 016 to 6/120=0. 05 are 9/120=0. 075 2 4/120=0. 33 r 11/120=0. 091 meet 2/120=0. 016 what 3/120=0. 025 mt 4/120=0. 033 wht 5/120=0. 041 met 4/120=0. 033 doing 4/120=0. 033 you 12/120=0. 1 dng 2/120=0. 016 u 18/120=0. 15 dong 4/120=0. 03 129
  6. 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME TABLE-IV BI-GRAM PDT FOR THE CORPUS IN TABLE-II Bi-Gram Bi-Gram Bi-Gram Bi-Gram Probability Probability (w1 w2) count(w1 w2) /count(w1) (w1 w2) count(w1 w2) /count(w1) i wnt 2/10=0. 2 met u 2/4=0. 5 wnt 2 1/2=0. 5 where are 1/2=0. 5 2 mt 3/4=0. 75 u dong 1/18=0. 05 mt u 2/4=0. 5 2 meet 1/4=0. 25 w are 2/6=0. 33 what are 2/3=0. 66 are u 4/9=0. 44 are you 5/9=0. 55 wr 4/6=0. 66 you dng 1/12=0. 08 ru 9/11=0. 81 wnt to 1/2=0. 5 u dng 1/18=0. 05 to mt 1/6=0. 16 i wan 5/10=0. 5 mt you 2/4=0. 5 wan 2 3/5=0. 6 you dong 3/12=0. 25 whe r 1/2=0. 5 whe are 1/2=0. 5 wht r 2/5=0. 4 wht are 2/5=0. 4 u doing 2/18=0. 11 you doing 2/12=0. 16 i want 3/10=0. 3 wan to 2/5=0. 4 want to 2/3=0. 66 to met 2/6=0. 66 to meet 1/6=0. 16 met you 3/4=0. 75 meet u 1/2=0. 5 wh r 1/2=0. 5 wh are 1/2=0. 5 r you 2/11=0. 18 what r 2/3=0. 66 where r 1/2=0. 5 2 met 2/4=0. 5 want 2 1/3=0. 66B. Bi-Gram PDT: Table IV shows Bi-Gram PDT for the corpus in Table II. From this PDT, it is observed Bi-Gram ‘r u’ occurs 9 times and ‘r you ’ occurs 2 times, which predicts that in this kind of shortform language chances of a person to write ‘r u’ is more than to write ‘r you’, hence ‘r u’ hasthe higher probability of 0.81 over ‘r you’ with probability 0.18.IV. PROPOSED WORK One can create a single matrix of Bi-Grams, instead of two separate PDT tables forUni-Gram and Bi-Gram, a multiplexed PDT [15]. Figure 2 shows a multiplexed PDT for thecorpus in Table II. Unlike, un-multiplexed PDT, this PDT contains the Bi-Gram counts. Thereason behind this is provision to calculate probability of a Uni-Gram from the same PDT.This PDT (matrix) is of size (V+1)*(V+1), where V is the total number of word types in thelanguage, the vocabulary size. ‘<S>’ is a special Uni-Gram used in between the sentences(as start of sentence or end of sentence). This special Uni-Gram plays an important role tofind the context of a sentence. To see the Bi-Gram count, corresponding row for the firstword and the count in the corresponding column for the second word in Bi-Gram is seen.From the PDT probability of Bi-Gram ‘r u’ is calculated as follows: • Uni-Gram count of ‘r’ is found by adding all the entries of ‘r’ row, which is 11. 130
  7. 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME Figure 2. Multiplexed PDT for the corpus in Table II count ( r) 11 P(r ) = = = 0.091 120 120 • In the row of Uni-Gram ‘r’ and column of Uni-Gram ‘u’, count is 9. count ( r u ) 9 P(r ) = = = 0.81 count ( r) 11 Majority of the values are zero in this matrix(sparse matrix), as the corpus considered islimited. As the size of the corpus grows one gets more combinations of word tokens as Bi-Grams (out of the scope of this article).V. EXPERIMENTAL SETUPThe project is divided into two phases: • Multiplexed PDT generation • Implementation of Back-off decoder using multiplexed PDT 131
  8. 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME In development and testing process, in first phase data for the project work is collectedfrom 10 persons, which is each of 1500 words. TABLE II shows a piece of the data collected.This data is used to train the LM to get a multiplexed PDT. Before providing word-alignedparallel corpus to first phase it is preprocessed by removing extra punctuation marks, extraspaces and representing begin and end of statement by ‘<S>’. Figure 3. Multiplexed PDT for the corpus in Table II to contain additional information about Bi-Gram This is done using regular expression meta characters in JAVA. This is useful in contextchecking. Figure 3 shows multiplexed PDT with some additional information about Bi-Gramrequired in software [4, 5], in which along with the information of probability we need toknow the long form for the Bi-Gram. This information is kept in the same matrix with a linkfield, a common link for all the source short form Bi-Grams having the same target long formtranslation. Some do-not-care (X) entries in this table are also used. These entries are used tosave the time of back-off decoder. While looking for a Bi-Gram, as soon as the decoder findsX, it copies the input phrase to the output string, without going for the further calculation ofprobability. Otherwise, if the decoder is unsuccessful to find non-zero entry for the Bi-Gramit breaks it into two Uni-Grams. These Uni-Grams are then separately handled by thedecoder. There is additional link field for target Uni-Gram long form translation. If thedecoder is unable to find the Uni-Gram in the PDT, it copies the input word as it is to theoutput string. 132
  9. 9. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEMEVI. EXPERIMENTAL RESULTS This software produces correct translations for the seen words and unseen words areoutput without any alteration. Also for some bi-grams like ‘w r’ the results depends on theindexing of the PDT. For example ’w r’ always produced ‘where are’, as word token ‘w’ inthe corpus appeared first time for long word ‘where’. This limitation can be overcome bymaking more than one entries for word token ‘w’ one when it appear in place of ‘where’ andanother when it appears in place of ‘what’. Word combinations like ‘lol’ for ‘lots of love’ cannot be expanded, as the work is limited to word aligned parallel corpus. Finallyimplementation point view creation and handling of multiplexed PDT is more complex ascompared to separate PDTs in machine translation application.CONCLUSION This work focuses on multiplexed PDT Bi-Gram based statistical LM, which is trainedin chatting slang language domain. SMT systems store different word forms as separatesymbols without any relation to each other and word forms or phrases that were not in thetraining data cannot be translated. As this application is meant for small devices like mobilephones we can prove this approach a memory saver. In future the work can be done on performance improvement by increasing the size ofthe corpus and the language model using multiplexed PDT. Patent and reference searches, andvarious information retrieval systems, communication on social networking websites are themain applications of the work.REFERENCES[1] Deana Pennell, Yang Liu, “Toward text message normalization: modeling abbreviation generation”, ICASSP2011, 2011 IEEE[2] Carlos A. Henr´ıquez Q., Adolfo Hern´andez H., “A N-gram based statistical machine translation approach for text normalization on chatspeak style communication”, 2009 CAW2.0 2009, April 21, 2009, Madrid, Spain[3] Waqas Anvar, Xuan Wang, lu Li, Xiao-Long Wang, “A statistical based part of speech tagger for Urdu language”, 2007 IEEE[4] Rina Damdoo, Urmila Shrawankar, “Probabilistic Language Model for Template Messaging based on Bi-Gram”, ICAESM-2012, 2012, IEEE[5] Rina Damdoo, Urmila Shrawankar, “Probabilistic N-Gram Language Model for SMS Lingo”, RACSS-2012, 2012, IEEE[6] Srinivas Bangalore, Vanessa Murdock, and Giuseppe Riccardi, “Bootstrapping bilingual data using consensus translation for a multilingual instant messaging system,” in 19th International Conference on Computational linguistics, Taipei, Taiwan, 2002, pp. 1–7.[7] Yong Zhao, Xiaodong He, “Using n-gram based features for machine translation”, Proceedings of NAACL HLT 2009: Short Papers, pages 205–208, Boulder, Colorado, June 2009[8] Marcello Federico, Mauro Cettolo, “Efficient handling of n-gram language models for statistical machine translation”, Proceedings of the Second Workshop on Statistical Machine Translation, pages 88–95, Prague, June 2007 133
  10. 10. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME[9] Josep M. Crego, Jos´e B. Mari ˜no, “Extending MARIE: an N-gram-based SMT decoder”, Proceedings of the ACL 2007 Demo and Poster Sessions, pages 213–216, Prague, June 2007[10] Zhenyu Lv, Wenju Liu, Zhanlei Yang, “A novel interpolated n-gram language model based on class hierarchy”, IEEE 2009[11] Najeeb Abdulmutalib, Norbert Fuhr, “Language models and smoothing methods for collections with large variation in document length”, 2008 IEEE[12] Aarthi Reddy, Richard C. Rose, “Integration of statistical models for dictation of document translations in a machine-aided human translation task”, IEEE transactions on audio, speech, and language processing, vol. 18, no. 8, November 2010[13] Evgeny Matusov, “System combination for machine translation of spoken and written language”, IEEE transactions on audio, speech, and language processing, vol. 16, no. 7, September 2008[14] Keisuke Iwami, Yasuhisa Fujii , Kazumasa Yamamoto, Seiichi Nakagawa, “Out-Of- Vocabulary Term Detection By N-Gram Array With Distance From Continuous Syllable Recognition Results”, IEEE 2010[15] Daniel Jurafsky and James H. Martin, “Speech and Language Processing”, Pearson, 2011[16] P. F. Brown, S. A. Della Pietra, V. J. Della Pietra and R. L. Mercer. “The mathematics of statistical machine translation: Parameter estimation”, Computational Linguistics, 19(2):263–311, 1993.[17] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst., ”Moses: Open source toolkit for statistical machine translation”, In Proceedings of the ACL 2007[18] J. B. Mari˜no, R. E. Banchs, J. M. Crego, A. de Gispert, P. Lambert, J. A. Fonollosa, and M. R.Costa-juss`a., “N-gram based machine translation”, Computational Linguistics, 32(4):527–549,2006.[19] S. M. Katz. “Estimation of probabilities from sparse data for the language model component of a speech Recognizer”. IEEE Trans. Acoust., Speech and Signal Proc., ASSP-35(3):400–401, 1987.[20] Mousmi Chaurasia and Dr. Sushil Kumar, “Natural Language Processing Based Information Retrieval for the Purpose of Author Identification” International Journal of Information Technology and Management Information Systems (IJITMIS), Volume 1, Issue 1, 2010, pp. 45 - 54, ISSN Print: 0976 – 6405, ISSN Online: 0976 - 6413.[21] P Mahalakshmi and M R Reddy, “Speech Processing Strategies for Cochlear Prostheses- The Past, Present and Future: A Tutorial Review” International Journal of Advanced Research in Engineering & Technology (IJARET), Volume 3, Issue 2, 2012, pp. 197 - 206, ISSN Print: 0976-6480, ISSN Online: 0976-6499. 134