Current state of the art pos tagging for indian languages – a study


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Current state of the art pos tagging for indian languages – a study

  1. 1. International Journal of Computer and Technology (IJCET), ISSN 0976 – 6367(Print),International Journal of Computer Engineering Engineeringand Technology (IJCET), ISSN 0976May - June Print) © IAEMEISSN 0976 – 6375(Online) Volume 1, Number 1, – 6367( (2010),ISSN 0976 – 6375(Online) Volume 1 IJCETNumber 1, May - June (2010), pp. 250-260 ©IAEME© IAEME, CURRENT STATE OF THE ART POS TAGGING FOR INDIAN LANGUAGES – A STUDY Shambhavi. B. R Department of CSE, R V College of Engineering Bangalore, E-Mail: Dr. Ramakanth Kumar P Department of ISE, R V College of Engineering Bangalore, E-Mail: Parts-of-speech (POS) tagging is the basic building block of any NaturalLanguage Processing (NLP) tool. A POS tagger has many applications. Especially forIndian languages, POS tagging adds many more dimensions as most of them areagglutinative, morphologically very rich highly inflected and are sometimes diglossic.Taggers have been developed using linguistic rules, stochastic models or both. This paperis a survey about different POS taggers developed for eight Indian Language, namelyHindi, Bengali, Tamil, Telugu, Gujarati, Malayalam, Manipuri and Assamese in therecent past.Keywords- Parts-of-speech tagger, Indian languages, agglutinativeI. INTRODUCTION India is a large multi-lingual country of diverse culture. It has many languageswith written forms and over a thousand spoken languages. The Constitution of Indiarecognizes 22 languages, spoken in different parts the country. The languages can becategorized into two major linguistic families namely Indo Aryan and Dravidian. Theseclasses of languages have some important differences. Their ways of developing wordsand grammar are different. But both include a lot of Sanskrit words. In addition, bothhave a similar construction and phraseology that links them close together. 250
  2. 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEME There is a need to develop information processing tools to facilitate humanmachine interaction, in Indian Languages and multi-lingual knowledge resources. A POStagger forms an integral part of any such processing tool to be developed. POS Tagginginvolves selecting the most likely sequences of syntactic categories for the words in asentence. The tagger facilitates the process of creating an annotated corpus. Annotatedcorpora find its major application in various NLP related applications like Text to SpeechConversion, Speech Recognition, Word sense disambiguation, Machine Translation,Information retrieval etc.II. TECHNIQUES FOR POS TAGGING There exist different approaches to POS Tagging. The tagging models can beclassified into Unsupervised and Supervised techniques. Both of these differ in terms ofthe degree of automation of the training and the tagging process. The unsupervised POStagging model does not require previously annotated corpus. Instead, they use advancedcomputational techniques to automatically induce tagsets, transformation rules, etc.Based on this information, they either calculate the probabilistic information needed bythe stochastic taggers or induce the contextual rules needed by rule based systems ortransformation based systems. The supervised POS Tagging models require a pre-annotated corpus which is used for training to learn information about the tagset, word-tag frequencies, the tag sequence probabilities and/or rule sets, etc. There are varioustaggers existing based on these models. Both the supervised and unsupervised taggers canbe further classified into the following types. POS Tagging Unsupervised Supervised Rule Based Stochastic Neural Rule Based Stochastic Neural Baum Welch Brill CRF Maximum Decision HMM Likelihood Trees N-grams SVM Viterbi Algorithm Figure 1Various techniques for POS tagging 251
  3. 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEMEA. Rule based Tagger Rule-based taggers use rules, which can be hand-coded or derived from data, atagged corpus. Rules are based on experience and help to distinguish the tag ambiguity.For example, Brill tagger is system of rule based tagging. It includes lexical rules, usedfor initialisation and contextual rules, used to correct the tags.B. Stochastic Tagger Stochastic taggers use statistics i.e., frequency or probability to tag the input text.The simplest stochastic taggers resolve ambiguity of words based on the probability thata word occurs with a particular tag. The tag encountered most frequently in the trainingset is the one assigned to an ambiguous instance of that word in the testing data. Thedisadvantage of this approach is that it might yield a correct tag for a given word but itcould also yield invalid sequences of tags. The other alternative to the word frequencyapproach is to calculate the probability of a given sequence of tags occurring. This isreferred to as the n-gram approach, referring to the fact that the best tag for a given wordis determined by the probability that it occurs with the n-1 previous tags. The stochasticmodel is based on different models such as Hidden Markov Model (HMM), MaximumLikelihood Estimation, Decision Trees, n-grams, Maximum Entropy, Support VectorMachines or Conditional Random Fields.C. Neural Tagger Neural Taggers are based on neural networks which learn the parameters of POStagger from a representative training data set [1]. The performance has shown to be betterthan stochastic taggers.III. CURRENT WORK IN INDIAN LANGUAGES There has been extensive work towards building a POS tagger for languagesacross the world. Western languages have annotated corpora in abundance and hence allmachine learning techniques have been tried. The accuracy of these taggersapproximately ranges from 93-98%. But tagging of Indian languages is a verychallenging task. The primary reason to this, being the limited availability of annotatedcorpora and morphological richness of Indian languages. This section details the workcarried out in various Universities and Research Centres in India in this regard. 252
  4. 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEMEA. Hindi In recent years, there has been lot of work towards building a POS tagger forHindi, the official language of India. Early work started with development of the partialPOS tagger by Ray [2]. This was followed by work by Shrivastava et al. whoproposed harnessing morphological characteristics of Hindi for POS tagging [3]. Thiswas further enhanced in [4], which suggests a methodology that makes use of detailedmorphological analysis and lexicon lookup for tagging. It used an annotated corpus ofaround 15,000 words collected from BBC news site and a decision tree based learningalgorithm – CN2. The accuracy was 93.45% with a tagset of 23 POS tags. International Institute of Information Technology (IIIT), Hyderabad, initiated aPOS tagging and chunking contest, NLPAI ML for the Indian languages in 2006. Severalteams came up with various approaches for tagging in three Indian languages namely,Hindi Bengali and Telugu. In this contest, CRFs were first applied to Hindi by Ravindranet. Al. [5] and Himanshu et. al.[6] for POS tagging and chunking, where they reported aperformance of 89.69% and 90.89% respectively. In the work of Sankaran Bhaskaran [7],HMM based statistical technique was attempted. Here probability models of certaincontextual features were also used. POS tagging of Hindi language based on MaximumEntropy Markov Model was developed by Aniket Dalal et al [8]. In this system, the mainPOS tagging features used were context based features, dictionary features, wordfeatures, and corpus-based features. In 2007, as part of the SPSAL workshop in IJCAI-07, IIIT, Hyderabad conducteda competition on POS tagging and chunking for south Asian languages of Hindi, Bengaliand Telugu. None of the teams tried the rule based approach. All eight participants triedwide range of learning techniques like HMM, Decision trees, CRF, Naïve Bayes andMaximum Entropy Model. The average POS tagging accuracy of all the systems forHindi, Bengali and Telugu are 73.93 %, 72.35 % and 71.83 % respectively. 253
  5. 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEME Table I. Summary of the approaches followed and accuracies obtained by various participating teams of SPSAL workshop Team Approach Hindi Bengali Telugu Used Pattabhi HMM 76.34 72.12 53.17 et al Satish HMM 69.35 60.08 77.20 and Kishore Rao and HMM 73.90 69.07 72.38 Yarowsky Himanshu CRF 62.35 76 77.16 Asif et al Hybrid 76.87 73.17 67.69 HMM Sandipan CRF 75.69 77.61 74.47 Ravi et al Max. 78.35 74.58 75.27 Entropy Avinesh Decision 78.66 76.08 77.37 and Trees Karthik Manish Shrivastava & Pushpak Bhattacharyya [9] designed a simple POS taggerfor Hindi based on HMM. It utilized the morphological richness of the language withoutrestoring to complex and expensive analysis. It achieved a good accuracy of 93.12%.Recent work in this area has been one by Ankur Parikh [10] where Neural Networks aretried for tagging. This multi-neuro tagger deals with sparse data, manages multiplecontexts, takes less training time and has good accuracy comparable to other traditionaltagging approaches for Indian languagesB. Bengali Bengali is an eastern Indo-Aryan language. It is ranked the sixth most spokenlanguage of the world. Almost all approaches to tagging have been experimented withBengali text. Participants at NLPAI Contest 2006 and SPSAL 2007 tried tagging forBengali along with Hindi and Telugu. The highest accuracies obtained were 84.34% and77.61% for Bengali in the contests respectively. HMM based tagger is reported in [11].Maximum Entropy based tagger was built in [12]. This tagger demonstrated an accuracyof 88.2% for a test set of 20,000 word forms. CRF and SVM based taggers are reported in[13] and [14] respectively. SVM tagger used 26 tags and had a performance of 86.84%. 254
  6. 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEMERecently Ekbal et. al applied voted approach [15] in order obtain best results in Bengalitagging.C. Tamil Tamil is the Dravidian language for which good and comparatively large workhas been done in the field of POS tagging. A work by Vasu Ranganathan named tag tamilis based on Lexical phonological approach. The tagger does morphotactics ofmorphological processing of verbs by using index method. Ganeshan’s POS Tagger [16]works on CIIL corpus. The tagset includes 82 tags at morph level and 22 at word level.Kathambam is a heuristic rule based tagger designed at RCILTS-Tamil. The performanceof the tagger is around 80%. It is based on the bigram model. In [17] a hybrid taggerusing rule based and HMM technique is developed. SVMTool was used to tag the corpusin [18] and an accuracy of 94.12% was obtained. Lakshmana Pandian and Geetha [19]experimented with a morpheme based tagger. A naive Bayes probabilistic model usingmorphemes is the first stage for preliminary POS tagging and a CRF model is the nextstage to disambiguate the conflicts that arise in the first stage. The overall accuracy of thetagger was 95.92%. Dhanalakshmi et. al [20] used SVM methodology based on Linearprogramming. This gave the accuracy of 95.63% on the test data.D. Telugu Telugu is the third most-spoken language in India (with about 74 million nativespeakers). It is the official language of Andhra Pradesh. In 2006, Sreeganesh [21]implemented a rule based POS tagger. In the initial stage, a Telugu MorphologicalAnalyzer analyses the input text. To this, tagset is added and finally around 524formulated morpho-syntactic rules do the disambiguation. During NLPAI Contest 2006, aPOS tagger of accuracy 81.59%was built. In SPSAL 2007 workshop of IJCAI-07, thebest Telugu tagger was proposed by Avinesh et. al [22] with a performance of 77.37%.In [23], three Telugu taggers namely (i) Rule-based tagger, (ii) Brill Tagger and (iii)Maximum Entropy tagger were developed with accuracies of 98.016%, 92.146%, and87.81% respectively. Recent work has been by Sindhiya Binulal et. al [24] who appliedSVMTool to tagging. The tagset included 10 tags and accuracy of around 95% wasobtained. 255
  7. 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEMEE. Gujarati Gujarati is a less privileged language with respect to available resources andmanually tagged data. As a very first step towards tagging, Chirag Patel and Karthik Gali[25] have designed a hybrid model. The linguistic rules specific to Gujarati are convertedinto features and provided to CRF, in order to take advantages of both statistical and rulebased approach. An accuracy of 92% has been achieved by this approach.F. Malayalam Malayalam is primarily spoken in Southern Coastal India by over 36 millionspeakers. It is one of the Dravidian languages where much work is still to be done. ManjuK et. al [26] experimented with the stochastic approach for tagging of Malayalam words.In the first step, a morphological analyzer is used to generate tagged corpora which arelater used by the HMM model based tagger. The results obtained were promising. Laterwork was by Antony P.J et. al [27] who applied SVM approach to tag words. Theyidentified the ambiguities in Malayalam lexical items, and developed a tag set of 29 tags.The result was more accurate compared to earlier work. With the increase in the numberof words in the training set, the performance increased to around 94%.G. Manipuri Manipuri language is the official language of Manipur. There are at least 29different dialects spoken in Manipur. The Manipuri tagging is dependent on themorphological analysis and lexical rules of each category. Hence Thoudam Doren Singhand Sivaji Bandyopadhyay initially tried to build a morphology driven tagger [28]. Thisshowed an accuracy of only 69%. Later they built a tagger [29] using ConditionalRandom Field (CRF) and Support Vector Machine (SVM). The tagset consisted of 26tags. Evaluation results demonstrated improvement in the accuracies. They obtained72.04%, and 74.38% accuracies in the CRF, and SVM, respectively.H. Assamese Assamese is a morphologically rich, relatively free word order and agglutinativelanguage like any other Indian languages. Navanath Saharia [30] built an Assamesetagger using the HMM model with Viterbi algorithm. An accuracy of 87% was achievedby the tagger for the test inputs. Pallav Kumar Dutta has attempted to develop an online 256
  8. 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEMEsemi automated tagger. This was designed to deal with sparse data problem of thelanguage. NLTK is used to tag the test data and for the ambiguous tags an online taggerwould help the user to change the tags.IV. CONCLUSION Development of a high accuracy POS tagger is an active research area in NLP.The bottleneck to POS tagging of Indian languages is the non-availability of lexicalresources. In addition, adoption of common tagset by researchers would facilitatereusability and interoperability of annotated corpora. We have in this paper a detailedstudy of the POS taggers developed for eight Indian languages. But there exist otherlanguages of the country, for which hardly any attempts towards building a POS taggerhave started.REFERENCES[1] Ahmed (2002), “Application of multilayer perceptron network for tagging parts-of-speech”, Proceedings of the Language Engineering Conference, IEEE.[2] A. Basu P. R. Ray, V. Harish and S. Sarkar(2003), ”Part of speech tagging and localword grouping techniques for natural language parsing in Hindi”, Proceedings of theInternational Conference on Natural Language Processing (ICON 2003).[3] S. Singh M. Shrivastava, N. Agrawal and P. Bhattacharya (2005), “Harnessingmorphological analysis in pos tagging task”, Proceedings of the International Conferenceon Natural Language Processing (ICON 2005).[4] Smriti Singh, Kuhoo Gupta, Manish Shrivastava, and Pushpak Bhattacharyya(2006),“Morphological richness offsets resource demand – experiences in constructing apos tagger for Hindi”, Proceedings of the COLING/ACL 2006 Main Conference PosterSessions, Sydney, Australia, pp. 779–786.[5] Pranjal Awasthi, Delip Rao, Balaraman Ravindran (2006), “Part Of Speech Taggingand Chunking with HMM and CRF”, Proceedings of the NLPAI MLcontest workshop,National Workshop on Artificial Intelligence.[6] Himanshu Agrawal, Anirudh Mani (2006), “Part Of Speech Tagging and ChunkingUsing Conditional Random Fields” Proceedings of the NLPAI MLcontest workshop,National Workshop on Artificial Intelligence. 257
  9. 9. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEME[7] Sankaran Baskaran (2006), “Hindi POS tagging and Chunking”, Proceedings of theNLPAI MLcontest workshop, National Workshop on Artificial Intelligence.[8] Aniket Dalal, Kumar Nagaraj, Uma Sawant, Sandeep Shelke (2006), “Hindi Part-of-Speech Tagging and Chunking: A Maximum Entropy Approach” Proceedings of theNLPAI MLcontest workshop, National Workshop on Artificial Intelligence.[9] Manish Shrivastava, Pushpak Bhattacharyya (2008), “Hindi POS Tagger Using NaiveStemming: Harnessing Morphological Information Without Extensive LinguisticKnowledge”, Proceedings of ICON-2008: 6th International Conference on NaturalLanguage Processing.[10] Ankur Parikh (2009), “Part-Of-Speech Tagging using Neural network”, Proceedingsof ICON-2009: 7th International Conference on Natural Language Processing.[11] Ekbal, Asif, Mondal, S., and S. Bandyopadhyay (2007) “POS Tagging using HMMand Rule-based Chunking”, In Proceedings of SPSAL-2007, IJCAI-07, pp. 25-28.[12] A. Ekbal, R. Haque and S. Bandyopadhyay (2008), “Maximum Entropy BasedBengali Part of Speech Tagging”, Advances in Natural Language Processing andApplications, Research in Computing Science (RCS) Journal, Vol. (33), pp. 67-78.[13] A. Ekbal, R. Haque and S. Bandyopadhyay (2007), “Bengali Part of Speech Taggingusing Conditional Random Field”, Proceedings of the 7th International Symposium onNatural Language Processing (SNLP-07), Thailand, pp.131-136.[14] A. Ekbal and S. Bandyopadhyay (2008), “Part of Speech Tagging in Bengali usingSupport Vector Machine”, Proceedings of the International Conference on InformationTechnology (ICIT 2008), pp.106-111, IEEE.[15] A. Ekbal , M. Hasanuzzaman and S. Bandyopadhyay (2009), “Voted Approach forPart of Speech Tagging in Bengali”, Proceedings of the 23rd Pacific Asia Conference onLanguage, Information and Computation (PACLIC-09), December 3-5, Hong Kong, pp.120-129.[16] Ganesan M (2007), “Morph and POS Tagger for Tamil” (Software) AnnamalaiUniversity, Annamalai Nagar.[17] Arulmozhi P, Sobha L (2006) “A Hybrid POS Tagger for a Relatively Free WordOrder Language”, Proceedings of MSPIL-2006, Indian Institute of Technology, Bombay. 258
  10. 10. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEME[18] Dhanalakshmi V, Anandkumar M, Vijaya M.S, Loganathan R, Soman K.P,Rajendran S (2008), “Tamil Part-of-Speech tagger based on SVMTool”, Proceedings ofthe COLIPS International Conference on Asian Language Processing 2008 (IALP),Chiang Mai, Thailand.[19] S. Lakshmana Pandian and T. V. Geetha (2008), “Morpheme based Language Modelfor Tamil Part-of-Speech Tagging”, Research journal on Computer science and computerengineering with applications, July-Dec 2008, pp. 19-25.[20]Dhanalakshmi V, Anandkumar M, Shivapratap G, Soman, K P, Rajendran S (2009)“Tamil POS Tagging using Linear Programming”, International Journal of Recent Trendsin Engineering, 1(2) pp.166-169.[21] T. Sreeganesh(2006), “Telugu Parts of Speech Tagging in WSD”, Language ofIndia, Vol 6: 8 August 2006.[22] Avinesh PVS and Karthik Gali (2007), “Part-of-speech tagging and chunking usingconditional random fields and transformation based learning”, Proceedings of the IJCAIand the Workshop On Shallow Parsing for South Asian Languages (SPSAL), pp. 21–24.[23] Rama Sree, R.J, Kusuma Kumari P (2007), “Combining POS Taggers for improvedAccuracy to create Telugu annotated texts for Information Retrieval”, Tirupati.[24] G.Sindhiya Binulal, P. Anand Goud, K.P.Soman(2009), “A SVM based approach toTelugu Parts Of Speech Tagging using SVMTool”, International Journal of RecentTrends in Engineering, Vol. 1, No. 2, May 2009[25] Chirag Patel and Karthik Gali (2008), “Part-Of-Speech Tagging for Gujarati UsingConditional Random Fields”, Proceedings of the IJCNLP-08 Workshop on NLP for LessPrivileged Languages, Hyderabad, India, pp. 117–122.[26] Manju K, Soumya S, Sumam Mary Idicula (2009), “Development of A Pos Taggerfor Malayalam-An Experience”, Proceedings of 2009 International Conference onAdvances in Recent Technologies in Communication and Computing, IEEE[27] Antony P.J, Santhanu P Mohan, Soman K.P (2010), “SVM Based Part of SpeechTagger for Malayalam”, Proceedings of 2010 International Conference on Recent Trendsin Information, Telecommunication and Computing, IEEE. 259
  11. 11. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEME[28] Thoudam Doren Singh, Sivaji Bandyopadhyay (2008), “Morphology DrivenManipuri POS Tagger”, Proceedings of the IJCNLP-08 Workshop on NLP for LessPrivileged Languages, Hyderabad, India, pp. 91–98.[29] Thoudam Doren Singh, Sivaji Bandyopadhyay (2008), “Manipuri POS Taggingusing CRF and SVM: A Language Independent Approach”, Proceedings of ICON-2008:6th International Conference on Natural Language Processing.[30] Navanath Saharia, Dhrubajyoti Das, Utpal Sharma, Jugal Kalita (2009), “Part ofSpeech Tagger for Assamese Text”, Proceedings of the ACL-IJCNLP 2009 ConferenceShort Papers, Suntec, Singapore, pp. 33–36. 260