More Related Content
Similar to Current state of the art pos tagging for indian languages – a study
Similar to Current state of the art pos tagging for indian languages – a study (20)
More from IAEME Publication
More from IAEME Publication (20)
Current state of the art pos tagging for indian languages – a study
- 1. International Journal of Computer and Technology (IJCET), ISSN 0976 – 6367(Print),
International Journal of Computer Engineering Engineering
and Technology (IJCET), ISSN 0976May - June Print) © IAEME
ISSN 0976 – 6375(Online) Volume 1, Number 1, – 6367( (2010),
ISSN 0976 – 6375(Online) Volume 1 IJCET
Number 1, May - June (2010), pp. 250-260 ©IAEME
© IAEME, http://www.iaeme.com/ijcet.html
CURRENT STATE OF THE ART POS TAGGING FOR
INDIAN LANGUAGES – A STUDY
Shambhavi. B. R
Department of CSE, R V College of Engineering
Bangalore, E-Mail: shambhavibr@rvce.edu.in
Dr. Ramakanth Kumar P
Department of ISE, R V College of Engineering
Bangalore, E-Mail: ramakanthkp@rvce.edu.in
ABSTRACT
Parts-of-speech (POS) tagging is the basic building block of any Natural
Language Processing (NLP) tool. A POS tagger has many applications. Especially for
Indian languages, POS tagging adds many more dimensions as most of them are
agglutinative, morphologically very rich highly inflected and are sometimes diglossic.
Taggers have been developed using linguistic rules, stochastic models or both. This paper
is a survey about different POS taggers developed for eight Indian Language, namely
Hindi, Bengali, Tamil, Telugu, Gujarati, Malayalam, Manipuri and Assamese in the
recent past.
Keywords- Parts-of-speech tagger, Indian languages, agglutinative
I. INTRODUCTION
India is a large multi-lingual country of diverse culture. It has many languages
with written forms and over a thousand spoken languages. The Constitution of India
recognizes 22 languages, spoken in different parts the country. The languages can be
categorized into two major linguistic families namely Indo Aryan and Dravidian. These
classes of languages have some important differences. Their ways of developing words
and grammar are different. But both include a lot of Sanskrit words. In addition, both
have a similar construction and phraseology that links them close together.
250
- 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),
ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEME
There is a need to develop information processing tools to facilitate human
machine interaction, in Indian Languages and multi-lingual knowledge resources. A POS
tagger forms an integral part of any such processing tool to be developed. POS Tagging
involves selecting the most likely sequences of syntactic categories for the words in a
sentence. The tagger facilitates the process of creating an annotated corpus. Annotated
corpora find its major application in various NLP related applications like Text to Speech
Conversion, Speech Recognition, Word sense disambiguation, Machine Translation,
Information retrieval etc.
II. TECHNIQUES FOR POS TAGGING
There exist different approaches to POS Tagging. The tagging models can be
classified into Unsupervised and Supervised techniques. Both of these differ in terms of
the degree of automation of the training and the tagging process. The unsupervised POS
tagging model does not require previously annotated corpus. Instead, they use advanced
computational techniques to automatically induce tagsets, transformation rules, etc.
Based on this information, they either calculate the probabilistic information needed by
the stochastic taggers or induce the contextual rules needed by rule based systems or
transformation based systems. The supervised POS Tagging models require a pre-
annotated corpus which is used for training to learn information about the tagset, word-
tag frequencies, the tag sequence probabilities and/or rule sets, etc. There are various
taggers existing based on these models. Both the supervised and unsupervised taggers can
be further classified into the following types.
POS Tagging
Unsupervised Supervised
Rule Based Stochastic Neural Rule Based Stochastic Neural
Baum Welch Brill CRF
Maximum Decision
HMM Likelihood Trees
N-grams SVM
Viterbi
Algorithm
Figure 1Various techniques for POS tagging
251
- 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),
ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEME
A. Rule based Tagger
Rule-based taggers use rules, which can be hand-coded or derived from data, a
tagged corpus. Rules are based on experience and help to distinguish the tag ambiguity.
For example, Brill tagger is system of rule based tagging. It includes lexical rules, used
for initialisation and contextual rules, used to correct the tags.
B. Stochastic Tagger
Stochastic taggers use statistics i.e., frequency or probability to tag the input text.
The simplest stochastic taggers resolve ambiguity of words based on the probability that
a word occurs with a particular tag. The tag encountered most frequently in the training
set is the one assigned to an ambiguous instance of that word in the testing data. The
disadvantage of this approach is that it might yield a correct tag for a given word but it
could also yield invalid sequences of tags. The other alternative to the word frequency
approach is to calculate the probability of a given sequence of tags occurring. This is
referred to as the n-gram approach, referring to the fact that the best tag for a given word
is determined by the probability that it occurs with the n-1 previous tags. The stochastic
model is based on different models such as Hidden Markov Model (HMM), Maximum
Likelihood Estimation, Decision Trees, n-grams, Maximum Entropy, Support Vector
Machines or Conditional Random Fields.
C. Neural Tagger
Neural Taggers are based on neural networks which learn the parameters of POS
tagger from a representative training data set [1]. The performance has shown to be better
than stochastic taggers.
III. CURRENT WORK IN INDIAN LANGUAGES
There has been extensive work towards building a POS tagger for languages
across the world. Western languages have annotated corpora in abundance and hence all
machine learning techniques have been tried. The accuracy of these taggers
approximately ranges from 93-98%. But tagging of Indian languages is a very
challenging task. The primary reason to this, being the limited availability of annotated
corpora and morphological richness of Indian languages. This section details the work
carried out in various Universities and Research Centres in India in this regard.
252
- 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),
ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEME
A. Hindi
In recent years, there has been lot of work towards building a POS tagger for
Hindi, the official language of India. Early work started with development of the partial
POS tagger by Ray et.al [2]. This was followed by work by Shrivastava et al. who
proposed harnessing morphological characteristics of Hindi for POS tagging [3]. This
was further enhanced in [4], which suggests a methodology that makes use of detailed
morphological analysis and lexicon lookup for tagging. It used an annotated corpus of
around 15,000 words collected from BBC news site and a decision tree based learning
algorithm – CN2. The accuracy was 93.45% with a tagset of 23 POS tags.
International Institute of Information Technology (IIIT), Hyderabad, initiated a
POS tagging and chunking contest, NLPAI ML for the Indian languages in 2006. Several
teams came up with various approaches for tagging in three Indian languages namely,
Hindi Bengali and Telugu. In this contest, CRFs were first applied to Hindi by Ravindran
et. Al. [5] and Himanshu et. al.[6] for POS tagging and chunking, where they reported a
performance of 89.69% and 90.89% respectively. In the work of Sankaran Bhaskaran [7],
HMM based statistical technique was attempted. Here probability models of certain
contextual features were also used. POS tagging of Hindi language based on Maximum
Entropy Markov Model was developed by Aniket Dalal et al [8]. In this system, the main
POS tagging features used were context based features, dictionary features, word
features, and corpus-based features.
In 2007, as part of the SPSAL workshop in IJCAI-07, IIIT, Hyderabad conducted
a competition on POS tagging and chunking for south Asian languages of Hindi, Bengali
and Telugu. None of the teams tried the rule based approach. All eight participants tried
wide range of learning techniques like HMM, Decision trees, CRF, Naïve Bayes and
Maximum Entropy Model. The average POS tagging accuracy of all the systems for
Hindi, Bengali and Telugu are 73.93 %, 72.35 % and 71.83 % respectively.
253
- 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),
ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEME
Table I. Summary of the approaches followed and accuracies obtained by various
participating teams of SPSAL workshop
Team Approach Hindi Bengali Telugu
Used
Pattabhi HMM 76.34 72.12 53.17
et al
Satish HMM 69.35 60.08 77.20
and
Kishore
Rao and HMM 73.90 69.07 72.38
Yarowsky
Himanshu CRF 62.35 76 77.16
Asif et al Hybrid 76.87 73.17 67.69
HMM
Sandipan CRF 75.69 77.61 74.47
Ravi et al Max. 78.35 74.58 75.27
Entropy
Avinesh Decision 78.66 76.08 77.37
and Trees
Karthik
Manish Shrivastava & Pushpak Bhattacharyya [9] designed a simple POS tagger
for Hindi based on HMM. It utilized the morphological richness of the language without
restoring to complex and expensive analysis. It achieved a good accuracy of 93.12%.
Recent work in this area has been one by Ankur Parikh [10] where Neural Networks are
tried for tagging. This multi-neuro tagger deals with sparse data, manages multiple
contexts, takes less training time and has good accuracy comparable to other traditional
tagging approaches for Indian languages
B. Bengali
Bengali is an eastern Indo-Aryan language. It is ranked the sixth most spoken
language of the world. Almost all approaches to tagging have been experimented with
Bengali text. Participants at NLPAI Contest 2006 and SPSAL 2007 tried tagging for
Bengali along with Hindi and Telugu. The highest accuracies obtained were 84.34% and
77.61% for Bengali in the contests respectively. HMM based tagger is reported in [11].
Maximum Entropy based tagger was built in [12]. This tagger demonstrated an accuracy
of 88.2% for a test set of 20,000 word forms. CRF and SVM based taggers are reported in
[13] and [14] respectively. SVM tagger used 26 tags and had a performance of 86.84%.
254
- 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),
ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEME
Recently Ekbal et. al applied voted approach [15] in order obtain best results in Bengali
tagging.
C. Tamil
Tamil is the Dravidian language for which good and comparatively large work
has been done in the field of POS tagging. A work by Vasu Ranganathan named tag tamil
is based on Lexical phonological approach. The tagger does morphotactics of
morphological processing of verbs by using index method. Ganeshan’s POS Tagger [16]
works on CIIL corpus. The tagset includes 82 tags at morph level and 22 at word level.
Kathambam is a heuristic rule based tagger designed at RCILTS-Tamil. The performance
of the tagger is around 80%. It is based on the bigram model. In [17] a hybrid tagger
using rule based and HMM technique is developed. SVMTool was used to tag the corpus
in [18] and an accuracy of 94.12% was obtained. Lakshmana Pandian and Geetha [19]
experimented with a morpheme based tagger. A naive Bayes probabilistic model using
morphemes is the first stage for preliminary POS tagging and a CRF model is the next
stage to disambiguate the conflicts that arise in the first stage. The overall accuracy of the
tagger was 95.92%. Dhanalakshmi et. al [20] used SVM methodology based on Linear
programming. This gave the accuracy of 95.63% on the test data.
D. Telugu
Telugu is the third most-spoken language in India (with about 74 million native
speakers). It is the official language of Andhra Pradesh. In 2006, Sreeganesh [21]
implemented a rule based POS tagger. In the initial stage, a Telugu Morphological
Analyzer analyses the input text. To this, tagset is added and finally around 524
formulated morpho-syntactic rules do the disambiguation. During NLPAI Contest 2006, a
POS tagger of accuracy 81.59%was built. In SPSAL 2007 workshop of IJCAI-07, the
best Telugu tagger was proposed by Avinesh et. al [22] with a performance of 77.37%.
In [23], three Telugu taggers namely (i) Rule-based tagger, (ii) Brill Tagger and (iii)
Maximum Entropy tagger were developed with accuracies of 98.016%, 92.146%, and
87.81% respectively. Recent work has been by Sindhiya Binulal et. al [24] who applied
SVMTool to tagging. The tagset included 10 tags and accuracy of around 95% was
obtained.
255
- 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),
ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEME
E. Gujarati
Gujarati is a less privileged language with respect to available resources and
manually tagged data. As a very first step towards tagging, Chirag Patel and Karthik Gali
[25] have designed a hybrid model. The linguistic rules specific to Gujarati are converted
into features and provided to CRF, in order to take advantages of both statistical and rule
based approach. An accuracy of 92% has been achieved by this approach.
F. Malayalam
Malayalam is primarily spoken in Southern Coastal India by over 36 million
speakers. It is one of the Dravidian languages where much work is still to be done. Manju
K et. al [26] experimented with the stochastic approach for tagging of Malayalam words.
In the first step, a morphological analyzer is used to generate tagged corpora which are
later used by the HMM model based tagger. The results obtained were promising. Later
work was by Antony P.J et. al [27] who applied SVM approach to tag words. They
identified the ambiguities in Malayalam lexical items, and developed a tag set of 29 tags.
The result was more accurate compared to earlier work. With the increase in the number
of words in the training set, the performance increased to around 94%.
G. Manipuri
Manipuri language is the official language of Manipur. There are at least 29
different dialects spoken in Manipur. The Manipuri tagging is dependent on the
morphological analysis and lexical rules of each category. Hence Thoudam Doren Singh
and Sivaji Bandyopadhyay initially tried to build a morphology driven tagger [28]. This
showed an accuracy of only 69%. Later they built a tagger [29] using Conditional
Random Field (CRF) and Support Vector Machine (SVM). The tagset consisted of 26
tags. Evaluation results demonstrated improvement in the accuracies. They obtained
72.04%, and 74.38% accuracies in the CRF, and SVM, respectively.
H. Assamese
Assamese is a morphologically rich, relatively free word order and agglutinative
language like any other Indian languages. Navanath Saharia et.al [30] built an Assamese
tagger using the HMM model with Viterbi algorithm. An accuracy of 87% was achieved
by the tagger for the test inputs. Pallav Kumar Dutta has attempted to develop an online
256
- 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),
ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEME
semi automated tagger. This was designed to deal with sparse data problem of the
language. NLTK is used to tag the test data and for the ambiguous tags an online tagger
would help the user to change the tags.
IV. CONCLUSION
Development of a high accuracy POS tagger is an active research area in NLP.
The bottleneck to POS tagging of Indian languages is the non-availability of lexical
resources. In addition, adoption of common tagset by researchers would facilitate
reusability and interoperability of annotated corpora. We have in this paper a detailed
study of the POS taggers developed for eight Indian languages. But there exist other
languages of the country, for which hardly any attempts towards building a POS tagger
have started.
REFERENCES
[1] Ahmed (2002), “Application of multilayer perceptron network for tagging parts-of-
speech”, Proceedings of the Language Engineering Conference, IEEE.
[2] A. Basu P. R. Ray, V. Harish and S. Sarkar(2003), ”Part of speech tagging and local
word grouping techniques for natural language parsing in Hindi”, Proceedings of the
International Conference on Natural Language Processing (ICON 2003).
[3] S. Singh M. Shrivastava, N. Agrawal and P. Bhattacharya (2005), “Harnessing
morphological analysis in pos tagging task”, Proceedings of the International Conference
on Natural Language Processing (ICON 2005).
[4] Smriti Singh, Kuhoo Gupta, Manish Shrivastava, and Pushpak Bhattacharyya
(2006),“Morphological richness offsets resource demand – experiences in constructing a
pos tagger for Hindi”, Proceedings of the COLING/ACL 2006 Main Conference Poster
Sessions, Sydney, Australia, pp. 779–786.
[5] Pranjal Awasthi, Delip Rao, Balaraman Ravindran (2006), “Part Of Speech Tagging
and Chunking with HMM and CRF”, Proceedings of the NLPAI MLcontest workshop,
National Workshop on Artificial Intelligence.
[6] Himanshu Agrawal, Anirudh Mani (2006), “Part Of Speech Tagging and Chunking
Using Conditional Random Fields” Proceedings of the NLPAI MLcontest workshop,
National Workshop on Artificial Intelligence.
257
- 9. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),
ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEME
[7] Sankaran Baskaran (2006), “Hindi POS tagging and Chunking”, Proceedings of the
NLPAI MLcontest workshop, National Workshop on Artificial Intelligence.
[8] Aniket Dalal, Kumar Nagaraj, Uma Sawant, Sandeep Shelke (2006), “Hindi Part-of-
Speech Tagging and Chunking: A Maximum Entropy Approach” Proceedings of the
NLPAI MLcontest workshop, National Workshop on Artificial Intelligence.
[9] Manish Shrivastava, Pushpak Bhattacharyya (2008), “Hindi POS Tagger Using Naive
Stemming: Harnessing Morphological Information Without Extensive Linguistic
Knowledge”, Proceedings of ICON-2008: 6th International Conference on Natural
Language Processing.
[10] Ankur Parikh (2009), “Part-Of-Speech Tagging using Neural network”, Proceedings
of ICON-2009: 7th International Conference on Natural Language Processing.
[11] Ekbal, Asif, Mondal, S., and S. Bandyopadhyay (2007) “POS Tagging using HMM
and Rule-based Chunking”, In Proceedings of SPSAL-2007, IJCAI-07, pp. 25-28.
[12] A. Ekbal, R. Haque and S. Bandyopadhyay (2008), “Maximum Entropy Based
Bengali Part of Speech Tagging”, Advances in Natural Language Processing and
Applications, Research in Computing Science (RCS) Journal, Vol. (33), pp. 67-78.
[13] A. Ekbal, R. Haque and S. Bandyopadhyay (2007), “Bengali Part of Speech Tagging
using Conditional Random Field”, Proceedings of the 7th International Symposium on
Natural Language Processing (SNLP-07), Thailand, pp.131-136.
[14] A. Ekbal and S. Bandyopadhyay (2008), “Part of Speech Tagging in Bengali using
Support Vector Machine”, Proceedings of the International Conference on Information
Technology (ICIT 2008), pp.106-111, IEEE.
[15] A. Ekbal , M. Hasanuzzaman and S. Bandyopadhyay (2009), “Voted Approach for
Part of Speech Tagging in Bengali”, Proceedings of the 23rd Pacific Asia Conference on
Language, Information and Computation (PACLIC-09), December 3-5, Hong Kong, pp.
120-129.
[16] Ganesan M (2007), “Morph and POS Tagger for Tamil” (Software) Annamalai
University, Annamalai Nagar.
[17] Arulmozhi P, Sobha L (2006) “A Hybrid POS Tagger for a Relatively Free Word
Order Language”, Proceedings of MSPIL-2006, Indian Institute of Technology, Bombay.
258
- 10. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),
ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEME
[18] Dhanalakshmi V, Anandkumar M, Vijaya M.S, Loganathan R, Soman K.P,
Rajendran S (2008), “Tamil Part-of-Speech tagger based on SVMTool”, Proceedings of
the COLIPS International Conference on Asian Language Processing 2008 (IALP),
Chiang Mai, Thailand.
[19] S. Lakshmana Pandian and T. V. Geetha (2008), “Morpheme based Language Model
for Tamil Part-of-Speech Tagging”, Research journal on Computer science and computer
engineering with applications, July-Dec 2008, pp. 19-25.
[20]Dhanalakshmi V, Anandkumar M, Shivapratap G, Soman, K P, Rajendran S (2009)
“Tamil POS Tagging using Linear Programming”, International Journal of Recent Trends
in Engineering, 1(2) pp.166-169.
[21] T. Sreeganesh(2006), “Telugu Parts of Speech Tagging in WSD”, Language of
India, Vol 6: 8 August 2006.
[22] Avinesh PVS and Karthik Gali (2007), “Part-of-speech tagging and chunking using
conditional random fields and transformation based learning”, Proceedings of the IJCAI
and the Workshop On Shallow Parsing for South Asian Languages (SPSAL), pp. 21–24.
[23] Rama Sree, R.J, Kusuma Kumari P (2007), “Combining POS Taggers for improved
Accuracy to create Telugu annotated texts for Information Retrieval”, Tirupati.
[24] G.Sindhiya Binulal, P. Anand Goud, K.P.Soman(2009), “A SVM based approach to
Telugu Parts Of Speech Tagging using SVMTool”, International Journal of Recent
Trends in Engineering, Vol. 1, No. 2, May 2009
[25] Chirag Patel and Karthik Gali (2008), “Part-Of-Speech Tagging for Gujarati Using
Conditional Random Fields”, Proceedings of the IJCNLP-08 Workshop on NLP for Less
Privileged Languages, Hyderabad, India, pp. 117–122.
[26] Manju K, Soumya S, Sumam Mary Idicula (2009), “Development of A Pos Tagger
for Malayalam-An Experience”, Proceedings of 2009 International Conference on
Advances in Recent Technologies in Communication and Computing, IEEE
[27] Antony P.J, Santhanu P Mohan, Soman K.P (2010), “SVM Based Part of Speech
Tagger for Malayalam”, Proceedings of 2010 International Conference on Recent Trends
in Information, Telecommunication and Computing, IEEE.
259
- 11. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 – 6367(Print),
ISSN 0976 – 6375(Online) Volume 1, Number 1, May - June (2010), © IAEME
[28] Thoudam Doren Singh, Sivaji Bandyopadhyay (2008), “Morphology Driven
Manipuri POS Tagger”, Proceedings of the IJCNLP-08 Workshop on NLP for Less
Privileged Languages, Hyderabad, India, pp. 91–98.
[29] Thoudam Doren Singh, Sivaji Bandyopadhyay (2008), “Manipuri POS Tagging
using CRF and SVM: A Language Independent Approach”, Proceedings of ICON-2008:
6th International Conference on Natural Language Processing.
[30] Navanath Saharia, Dhrubajyoti Das, Utpal Sharma, Jugal Kalita (2009), “Part of
Speech Tagger for Assamese Text”, Proceedings of the ACL-IJCNLP 2009 Conference
Short Papers, Suntec, Singapore, pp. 33–36.
260