Named Entity Recognition (NER) is a major early step in Natural Language Processing (NLP) tasks like machine translation, text to speech synthesis, natural language understanding etc. It seeks to classify words which represent names in text into predefined categories like location, person-name, organization, date, time etc. In this paper we have used a combination of machine learning and Rule based approaches to classify named entities. The paper introduces a hybrid approach for NER. We have experimented with Statistical approaches like Conditional Random Fields (CRF) & Maximum Entropy (MaxEnt) and Rule based approach based on the set of linguistic rules. Linguistic approach plays a vital role in overcoming the limitations of statistical models for morphologically rich language like Hindi. Also the system uses voting method to improve the performance of the NER system. Keywords: NER, MaxEnt, CRF, Rule base, Voting, Hybrid approach
KANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATIONijnlc
Named Entity Recognition and Classification (NERC) is a process of identification of proper nouns in the text and classification of those nouns into certain predefined categories like person name, location,organization, date, and time etc. NERC in Kannada is an essential and challenging task. The aim of this work is to develop a novel model for NERC, based on Multinomial Naïve Bayes (MNB) Classifier. The Methodology adopted in this paper is based on feature extraction of training corpus, by using term frequency, inverse document frequency and fitting them to a tf-idf-vectorizer. The paper discusses the
various issues in developing the proposed model. The details of implementation and performance evaluation are discussed. The experiments are conducted on a training corpus of size 95,170 tokens and test corpus of 5,000 tokens. It is observed that the model works with Precision, Recall and F1-measure of
83%, 79% and 81% respectively.
Named Entity Recognition for Telugu Using Conditional Random FieldWaqas Tariq
Named Entity (NE) recognition is a task in which proper nouns and numerical information are extracted from documents and are classified into predefined categories such as Person names, Organization names , Location names, miscellaneous(Date and others). It is a key technology of Information Extraction, Question Answering system, Machine Translations, Information Retrial etc. This paper reports about the development of a NER system for Telugu using Conditional Random field (CRF). Though this state of the art machine learning technique has been widely applied to NER in several well-studied languages, the use of this technique to Telugu languages is very new. The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the four different named entities (NE) classes, such as Person name, Location name, Organization name, miscellaneous (Date and others). Keywords: Named entity, Conditional Random field, NE, CRF, NER, named entity recognition
ISSUES AND CHALLENGES IN MARATHI NAMED ENTITY RECOGNITIONijnlc
Information Extraction (IE) is a sub discipline of Artificial Intelligence. IE identifies information in
unstructured information source that adheres to predefined semantics i.e. people, location etc. Recognition
of named entities (NEs) from computer readable natural language text is significant task of IE and natural
language processing (NLP). Named entity (NE) extraction is important step for processing unstructured
content. Unstructured data is computationally opaque. Computers require computationally transparent
data for processing. IE adds meaning to raw data so that it can be easily processed by computers. There
are various different approaches that are applied for extraction of entities from text. This paper elaborates
need of NE recognition for Marathi and discusses issues and challenges involved in NE recognition tasks
for Marathi language. It also explores various methods and techniques that are useful for creation of
learning resources and lexicons that are important for extraction of NEs from natural language
unstructured text.
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES ijnlc
Stemming is the process of term conflation. It conflates all the word variants to a common form called as stem. It plays significant role in numerous Natural Language Processing (NLP) applications like morphological analysis, parsing, document summarization, text classification, part-of-speech tagging, question-answering system, machine translation, word sense disambiguation, information retrieval (IR), etc. Each of these tasks requires some pre-processing to be done. Stemming is one of the important building blocks for all these applications. This paper, presents an overview of various stemming techniques, evaluation criteria for stemmers and various existing stemmers for Indic languages.
Named Entity Recognition using Hidden Markov Model (HMM)kevig
Named Entity Recognition (NER) is the subtask of Natural Language Processing (NLP) which is the branch of artificial intelligence. It has many applications mainly in machine translation, text to speech synthesis, natural language understanding, Information Extraction, Information retrieval, question answering etc. The aim of NER is to classify words into some predefined categories like location name, person name, organization name, date, time etc. In this paper we describe the Hidden Markov Model (HMM) based approach of machine learning in detail to identify the named entities. The main idea behind the use of HMM model for building NER system is that it is language independent and we can apply this system for any language domain. In our NER system the states are not fixed means it is of dynamic in nature one can use it according to their interest. The corpus used by our NER system is also not domain specific
Script Identification of Text Words from a Tri-Lingual Document Using Voting ...CSCJournals
In a multi script environment, majority of the documents may contain text information printed in more than one script/language forms. For automatic processing of such documents through Optical Character Recognition (OCR), it is necessary to identify different script regions of the document. In this context, this paper proposes to develop a model to identify and separate text words of Kannada, Hindi and English scripts from a printed tri-lingual document. The proposed method is trained to learn thoroughly the distinct features of each script. The binary tree classifier is used to classify the input text image. Experimentation conducted involved 1500 text words for learning and 1200 text words for testing. Extensive experimentation has been carried out on both manually created data set and scanned data set. The results are very encouraging and prove the efficacy of the proposed model. The average success rate is found to be 99% for manually created data set and 98.5% for data set constructed from scanned document images.
In this paper, we made a survey on Word Sense Disambiguation (WSD). Near about in all major languages
around the world, research in WSD has been conducted upto different extents. In this paper, we have gone
through a survey regarding the different approaches adopted in different research works, the State of the
Art in the performance in this domain, recent works in different Indian languages and finally a survey in
Bengali language. We have made a survey on different competitions in this field and the bench mark
results, obtained from those competitions.
A New Approach to Parts of Speech Tagging in Malayalamijcsit
Parts-of-speech tagging is the process of labeling each word in a sentence. A tag mentions the word’s
usage in the sentence. Usually, these tags indicate syntactic classification like noun or verb, and sometimes
include additional information, with case markers (number, gender etc) and tense markers. A large number
of current language processing systems use a parts-of-speech tagger for pre-processing.
There are mainly two approaches usually followed in Parts of Speech Tagging. Those are Rule based
Approach and Stochastic Approach. Rule based Approach use predefined handwritten rules. This is the
oldest approach and it use lexicon or dictionary for reference. Stochastic Approach use probabilistic and
statistical information to assign tag to words. It use large corpus, so that Time complexity and Space
complexity is high whereas Rule base approach has less complexity for both Time and Space. Stochastic
Approach is the widely used one nowadays because of its accuracy.
Malayalam is a Dravidian family of languages, inflectional with suffixes with the root word forms. The
currently used Algorithms are efficient Machine Learning Algorithms but these are not built for
Malayalam. So it affects the accuracy of the result of Malayalam POS Tagging.
My proposed Approach use Dictionary entries along with adjacent tag information. This algorithm use
Multithreaded Technology. Here tagging done with the probability of the occurrence of the sentence
structure along with the dictionary entry.
KANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATIONijnlc
Named Entity Recognition and Classification (NERC) is a process of identification of proper nouns in the text and classification of those nouns into certain predefined categories like person name, location,organization, date, and time etc. NERC in Kannada is an essential and challenging task. The aim of this work is to develop a novel model for NERC, based on Multinomial Naïve Bayes (MNB) Classifier. The Methodology adopted in this paper is based on feature extraction of training corpus, by using term frequency, inverse document frequency and fitting them to a tf-idf-vectorizer. The paper discusses the
various issues in developing the proposed model. The details of implementation and performance evaluation are discussed. The experiments are conducted on a training corpus of size 95,170 tokens and test corpus of 5,000 tokens. It is observed that the model works with Precision, Recall and F1-measure of
83%, 79% and 81% respectively.
Named Entity Recognition for Telugu Using Conditional Random FieldWaqas Tariq
Named Entity (NE) recognition is a task in which proper nouns and numerical information are extracted from documents and are classified into predefined categories such as Person names, Organization names , Location names, miscellaneous(Date and others). It is a key technology of Information Extraction, Question Answering system, Machine Translations, Information Retrial etc. This paper reports about the development of a NER system for Telugu using Conditional Random field (CRF). Though this state of the art machine learning technique has been widely applied to NER in several well-studied languages, the use of this technique to Telugu languages is very new. The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the four different named entities (NE) classes, such as Person name, Location name, Organization name, miscellaneous (Date and others). Keywords: Named entity, Conditional Random field, NE, CRF, NER, named entity recognition
ISSUES AND CHALLENGES IN MARATHI NAMED ENTITY RECOGNITIONijnlc
Information Extraction (IE) is a sub discipline of Artificial Intelligence. IE identifies information in
unstructured information source that adheres to predefined semantics i.e. people, location etc. Recognition
of named entities (NEs) from computer readable natural language text is significant task of IE and natural
language processing (NLP). Named entity (NE) extraction is important step for processing unstructured
content. Unstructured data is computationally opaque. Computers require computationally transparent
data for processing. IE adds meaning to raw data so that it can be easily processed by computers. There
are various different approaches that are applied for extraction of entities from text. This paper elaborates
need of NE recognition for Marathi and discusses issues and challenges involved in NE recognition tasks
for Marathi language. It also explores various methods and techniques that are useful for creation of
learning resources and lexicons that are important for extraction of NEs from natural language
unstructured text.
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES ijnlc
Stemming is the process of term conflation. It conflates all the word variants to a common form called as stem. It plays significant role in numerous Natural Language Processing (NLP) applications like morphological analysis, parsing, document summarization, text classification, part-of-speech tagging, question-answering system, machine translation, word sense disambiguation, information retrieval (IR), etc. Each of these tasks requires some pre-processing to be done. Stemming is one of the important building blocks for all these applications. This paper, presents an overview of various stemming techniques, evaluation criteria for stemmers and various existing stemmers for Indic languages.
Named Entity Recognition using Hidden Markov Model (HMM)kevig
Named Entity Recognition (NER) is the subtask of Natural Language Processing (NLP) which is the branch of artificial intelligence. It has many applications mainly in machine translation, text to speech synthesis, natural language understanding, Information Extraction, Information retrieval, question answering etc. The aim of NER is to classify words into some predefined categories like location name, person name, organization name, date, time etc. In this paper we describe the Hidden Markov Model (HMM) based approach of machine learning in detail to identify the named entities. The main idea behind the use of HMM model for building NER system is that it is language independent and we can apply this system for any language domain. In our NER system the states are not fixed means it is of dynamic in nature one can use it according to their interest. The corpus used by our NER system is also not domain specific
Script Identification of Text Words from a Tri-Lingual Document Using Voting ...CSCJournals
In a multi script environment, majority of the documents may contain text information printed in more than one script/language forms. For automatic processing of such documents through Optical Character Recognition (OCR), it is necessary to identify different script regions of the document. In this context, this paper proposes to develop a model to identify and separate text words of Kannada, Hindi and English scripts from a printed tri-lingual document. The proposed method is trained to learn thoroughly the distinct features of each script. The binary tree classifier is used to classify the input text image. Experimentation conducted involved 1500 text words for learning and 1200 text words for testing. Extensive experimentation has been carried out on both manually created data set and scanned data set. The results are very encouraging and prove the efficacy of the proposed model. The average success rate is found to be 99% for manually created data set and 98.5% for data set constructed from scanned document images.
In this paper, we made a survey on Word Sense Disambiguation (WSD). Near about in all major languages
around the world, research in WSD has been conducted upto different extents. In this paper, we have gone
through a survey regarding the different approaches adopted in different research works, the State of the
Art in the performance in this domain, recent works in different Indian languages and finally a survey in
Bengali language. We have made a survey on different competitions in this field and the bench mark
results, obtained from those competitions.
A New Approach to Parts of Speech Tagging in Malayalamijcsit
Parts-of-speech tagging is the process of labeling each word in a sentence. A tag mentions the word’s
usage in the sentence. Usually, these tags indicate syntactic classification like noun or verb, and sometimes
include additional information, with case markers (number, gender etc) and tense markers. A large number
of current language processing systems use a parts-of-speech tagger for pre-processing.
There are mainly two approaches usually followed in Parts of Speech Tagging. Those are Rule based
Approach and Stochastic Approach. Rule based Approach use predefined handwritten rules. This is the
oldest approach and it use lexicon or dictionary for reference. Stochastic Approach use probabilistic and
statistical information to assign tag to words. It use large corpus, so that Time complexity and Space
complexity is high whereas Rule base approach has less complexity for both Time and Space. Stochastic
Approach is the widely used one nowadays because of its accuracy.
Malayalam is a Dravidian family of languages, inflectional with suffixes with the root word forms. The
currently used Algorithms are efficient Machine Learning Algorithms but these are not built for
Malayalam. So it affects the accuracy of the result of Malayalam POS Tagging.
My proposed Approach use Dictionary entries along with adjacent tag information. This algorithm use
Multithreaded Technology. Here tagging done with the probability of the occurrence of the sentence
structure along with the dictionary entry.
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
In recent decades speech interactive systems have gained increasing importance. Performance of an ASR system mainly depends on the availability of large corpus of speech. The conventional method of building a large vocabulary speech recognizer for any language uses a top-down approach to speech. This approach requires large speech corpus with sentence or phoneme level transcription of the speech utterances. The transcriptions must also include different speech order so that the recognizer can build models for all the sounds present. But, for Telugu language, because of its complex nature, a very large, well annotated speech database is very difficult to build. It is very difficult, if not impossible, to cover all the words of any Indian language, where each word may have thousands and millions of word forms. A significant part of grammar that is handled by syntax in English (and other similar languages) is handled within morphology in Telugu. Phrases including several words (that is, tokens) in English would be mapped on to a single word in Telugu.Telugu language is phonetic in nature in addition to rich in morphology. That is why the speech technology developed for English cannot be applied to Telugu language. This paper highlights the work carried out in an attempt to build a voice enabled text editor with capability of automatic term suggestion. Main claim of the paper is the recognition enhancement process developed by us for suitability of highly inflecting, rich morphological languages. This method results in increased speech recognition accuracy with very much reduction in corpus size. It also adapts Telugu words to the database dynamically, resulting in growth of the corpus.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Marathi-English CLIR using detailed user query and unsupervised corpus-based WSDIJERA Editor
With rapid growth of multilingual information on the Internet, Cross Language Information Retrieval (CLIR) is becoming need of the day. It helps user to query in their native language and retrieve information in any language. But the performance of CLIR is poor as compared to monolingual retrieval due to lexical ambiguity, mismatching of query terms and out-of-vocabulary words. In this paper, we have proposed an algorithm for improving the performance of Marathi-English CLIR system. The system first finds possible translations of input query in target language, disambiguates them and then gives English queries to search engine for relevant document retrieval. The disambiguation is based on unsupervised corpus-based method which uses English dictionary as additional resource. The experiment is performed on FIRE 2011 (Forum of Information Retrieval Evaluation) dataset using “Title” and “Description” fields as inputs. The experimental results show that proposed approach gives better performance of Marathi-English CLIR system with good precision level.
Design and Development of a Malayalam to English Translator- A Transfer Based...Waqas Tariq
This paper describes a transfer based scheme for translating Malayalam, a Dravidian language, to English. This system inputs Malayalam sentences and outputs equivalent English sentences. The system comprises of a preprocessor for splitting the compound words, a morphological parser for context disambiguation and chunking, a syntactic structure transfer module and a bilingual dictionary. All the modules are morpheme based to reduce dictionary size. The system does not rely on a stochastic approach and it is based on a rule-based architecture along with various linguistic knowledge components of both Malayalam and English. The system uses two sets of rules: rules for Malayalam morphology and rules for syntactic structure transfer from Malayalam to English. The system is designed using artificial intelligence techniques.
A New Concept Extraction Method for Ontology Construction From Arabic TextCSCJournals
Ontology is one of the most popular representation model used for knowledge representation, sharing and reusing. The Arabic language has complex morphological, grammatical, and semantic aspects. Due to complexity of Arabic language, automatic Arabic terminology extraction is difficult. In addition, concept extraction from Arabic documents has been challenging research area, because, as opposed to term extraction, concept extraction are more domain related and more selective. In this paper, we present a new concept extraction method for Arabic ontology construction, which is the part of our ontology construction framework. A new method to extract domain relevant single and multi-word concepts in the domain has been proposed, implemented and evaluated. Our method combines linguistic, statistical information and domain knowledge. It first uses linguistic patterns based on POS tags to extract concept candidates, and then stop words filter is implemented to filter unwanted strings. To determine relevance of these candidates within the domain, different statistical measures and new domain relevance measure are implemented for first time for Arabic language. To enhance the performance of concept extraction, a domain knowledge will be integrated into the module. The concepts scores are calculated according to their statistical values and domain knowledge values. In order to evaluate the performance of the method, precision scores were calculated. The results show the high effectiveness of the proposed approach to extract concepts for Arabic ontology construction.
Wavelet Packet Based Features for Automatic Script IdentificationCSCJournals
In a multi script environment, an archive of documents having the text regions printed in different scripts is in practice. For automatic processing of such documents through Optical Character Recognition (OCR), it is necessary to identify different script regions of the document. In this paper, a novel texture-based approach is presented to identify the script type of the collection of documents printed in seven scripts, to categorize them for further processing. The South Indian documents printed in the seven scripts - Kannada, Tamil, Telugu, Malayalam, Urdu, Hindi and English are considered here. The document images are decomposed through the Wavelet Packet Decomposition using the Haar basis function up to level two. The texture features are extracted from the sub bands of the wavelet packet decomposition. The Shannon entropy value is computed for the set of sub bands and these entropy values are combined to use as the texture features. Experimentation conducted involved 2100 text images for learning and 1400 text images for testing. Script classification performance is analyzed using the K-nearest neighbor classifier. The average success rate is found to be 99.68%.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Information retrieval (IR) system aims to retrieve
relevant documents to a user query where the query is a set of
keywords. Cross-language information retrieval (CLIR) is a
retrieval process in which the user fires queries in one language to
retrieve information from another language. The growing
requirement on the Internet for users to access information
expressed in language other than their own has led to Cross
Language Information Retrieval (CLIR) becoming established as
a major topic in IR.
Automatic classification of bengali sentences based on sense definitions pres...ijctcm
Based on the sense definition of words available in the Bengali WordNet, an attempt is made to classify the
Bengali sentences automatically into different groups in accordance with their underlying senses. The input
sentences are collected from 50 different categories of the Bengali text corpus developed in the TDIL
project of the Govt. of India, while information about the different senses of particular ambiguous lexical
item is collected from Bengali WordNet. In an experimental basis we have used Naive Bayes probabilistic
model as a useful classifier of sentences. We have applied the algorithm over 1747 sentences that contain a
particular Bengali lexical item which, because of its ambiguous nature, is able to trigger different senses
that render sentences in different meanings. In our experiment we have achieved around 84% accurate
result on the sense classification over the total input sentences. We have analyzed those residual sentences
that did not comply with our experiment and did affect the results to note that in many cases, wrong
syntactic structures and less semantic information are the main hurdles in semantic classification of
sentences. The applicational relevance of this study is attested in automatic text classification, machine
learning, information extraction, and word sense disambiguation
A Review on the Cross and Multilingual Information Retrievaldannyijwest
In this paper we explore some of the most important areas of information retrieval. In particular, Cross-
lingual Information Retrieval (CLIR) and Multilingual Information Retrieval (MLIR). CLIR deals with
asking questions in one language and retrieving documents in different language. MLIR deals with asking
questions in one or more languages and retrieving documents in one or more different languages. With an
increasingly globalized economy, the ability to find information in other languages is becoming a necessity.
We also presented the evaluation initiatives of information retrieval domain. Finally we have presented the
overall review of the research works in Indian and Foreign languages.
Survey on Indian CLIR and MT systems in Marathi LanguageEditor IJCATR
Cross Language Information Retrieval (CLIR) deals with retrieving relevant information stored in a language different from
the language of user’s query. This helps users to express the information need in their native languages. Machine translation based (MTbased)
approach of CLIR uses existing machine translation techniques to provide automatic translation of queries. This paper covers the
research work done in CLIR and MT systems for Marathi language in India.
An Improved Approach for Word Ambiguity RemovalWaqas Tariq
Word ambiguity removal is a task of removing ambiguity from a word, i.e. correct sense of word is identified from ambiguous sentences. This paper describes a model that uses Part of Speech tagger and three categories for word sense disambiguation (WSD). Human Computer Interaction is very needful to improve interactions between users and computers. For this, the Supervised and Unsupervised methods are combined. The WSD algorithm is used to find the efficient and accurate sense of a word based on domain information. The accuracy of this work is evaluated with the aim of finding best suitable domain of word. Keywords: Human Computer Interaction, Supervised Training, Unsupervised Learning, Word Ambiguity, Word sense disambiguation
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...Syeful Islam
More than hundreds of millions of people of almost all levels of education and attitudes from different country communicate with each other for different purposes using various languages. Machine translation is highly demanding due to increasing the usage of web based Communication. One of the major problem of Bengali translation is identified a naming word from a sentence, which is relatively simple in English language, because such entities start with a capital letter. In Bangla we do not have concept of small or capital letters and there is huge no. of different naming entity available in Bangla. Thus we find difficulties in understanding whether a word is a naming word or not. Here we have introduced a new approach to identify naming word from a Bengali sentence for machine translation system without storing huge no. of naming entity in word dictionary. The goal is to make possible Bangla sentence conversion with minimal storing word in dictionary.
Punjabi to Hindi Transliteration System for Proper Nouns Using Hybrid ApproachIJERA Editor
The language is an effective medium for the communication that conveys the ideas and expression of the human
mind. There are more than 5000 languages in the world for the communication. To know all these languages is
not a solution for problems due to the language barrier in communication. In this multilingual world with the
huge amount of information exchanged between various regions and in different languages in digitized format,
it has become necessary to find an automated process to convert from one language to another. Natural
Language Processing (NLP) is one of the hot areas of research that explores how computers can be utilizing to
understand and manipulate natural language text or speech. In the Proposed system a Hybrid approach to
transliterate the proper nouns from Punjabi to Hindi is developed. Hybrid approach in the proposed system is a
combination of Direct Mapping, Rule based approach and Statistical Machine Translation approach (SMT).
Proposed system is tested on various proper nouns from different domains and accuracy of the proposed system
is very good.
MULTI-WORD TERM EXTRACTION BASED ON NEW HYBRID APPROACH FOR ARABIC LANGUAGEcsandit
Arabic Multiword Term are relevant strings of words in text documents. Once they are
automatically extracted, they can be used to increase the performance of any text mining
applications such as Categorisation, Clustering, Information Retrieval System, Machine
Translation, and Summarization, etc. This paper introduces our proposed Multiword term
extraction system based on the contextual information. In fact, we propose a new method based
a hybrid approach for Arabic Multiword term extraction. Like other method based on hybrid
approach, our method is composed by two main steps: the Linguistic approach and the
Statistical one. In the first step, the Linguistic approach uses Part Of Speech (POS) Tagger
(Taani’s Tagger) and the Sequence Identifier as patterns in order to extract the candidate
AMTWs. While in the second one which includes our main contribution, the Statistical approach
incorporates the contextual information by using a new proposed association measure based on
Termhood and Unithood for AMWTs extraction. To evaluate the efficiency of our proposed
method for AMWTs extraction, this later has been tested and compared using three different
association measures: the proposed one named NTC-Value, NC-Value, and C-Value. The
experimental results using Arabic Texts taken from the environment domain, show that our
hybrid method outperforms the other ones in term of precision, in addition, it can deal correctly
with tri-gram Arabic Multiword terms.
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATAijnlc
Due to the dramatic growth of internet use, the amount of unstructured Bengali text data has increased
enormous. It is therefore essential to extract event intelligently from it. The progress in technologies in
natural language processing (NLP) for information extraction that is used to locate and classify content in
news data according to predefined categories such as person name, place name, organization name, date,
time etc. The current named entity recognition (NER), which is a subtask of NLP, plays a vital rule to
achieve human level performance on specific documents such as newspapers to effectively identify entities.
The purpose of this research is to introduce NER system in Bengali news data to identify events of specified
things in running text based on regular expression and Bengali grammar. In so doing, I have designed and
evaluated part-of-speech (POS) tags to recognize proper nouns. In this thesis, I have explained Hidden
Markov Model (HMM) based approach for developing NER system from Bengali news data.
S URVEY O N M ACHINE T RANSLITERATION A ND M ACHINE L EARNING M ODELSijnlc
Globalization and growth of Internet users truly demands for almost all internet based applications to
support
l
oca
l l
anguages. Support
of
l
oca
l
l
anguages can be
given in all internet based applications by
means of Machine Transliteration
and
Machine Translation
.
This paper provides the thorough survey on
machine transliteration models and machine learning
approaches
used for machine transliteration
over the
period
of more than two decades
for internationally used languages as well as Indian languages.
Survey
shows that linguistic approach provides better results for the closely related languages and probability
based statistical approaches are good when one of the
languages is phonetic and other is non
-
phonetic.
B
etter accuracy can be achieved only by using Hybrid and Combined models.
STUDY OF NAMED ENTITY RECOGNITION FOR INDIAN LANGUAGESijistjournal
Named Entity Recognition is a prior task in Natural Language Processing. Named Entity Recognition is a sub task of information extraction and it identifies and classifies proper nouns in to its predefined categories such as person, location, organization, time, date etc. In this document the major focus is given on NER approaches and the work done till now for various languages to identify Named Entities is been discussed. Author have done comparative study to recognize named entity and identified that CRF approach proven best for Indian languages to identify named entity.
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
In recent decades speech interactive systems have gained increasing importance. Performance of an ASR system mainly depends on the availability of large corpus of speech. The conventional method of building a large vocabulary speech recognizer for any language uses a top-down approach to speech. This approach requires large speech corpus with sentence or phoneme level transcription of the speech utterances. The transcriptions must also include different speech order so that the recognizer can build models for all the sounds present. But, for Telugu language, because of its complex nature, a very large, well annotated speech database is very difficult to build. It is very difficult, if not impossible, to cover all the words of any Indian language, where each word may have thousands and millions of word forms. A significant part of grammar that is handled by syntax in English (and other similar languages) is handled within morphology in Telugu. Phrases including several words (that is, tokens) in English would be mapped on to a single word in Telugu.Telugu language is phonetic in nature in addition to rich in morphology. That is why the speech technology developed for English cannot be applied to Telugu language. This paper highlights the work carried out in an attempt to build a voice enabled text editor with capability of automatic term suggestion. Main claim of the paper is the recognition enhancement process developed by us for suitability of highly inflecting, rich morphological languages. This method results in increased speech recognition accuracy with very much reduction in corpus size. It also adapts Telugu words to the database dynamically, resulting in growth of the corpus.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Marathi-English CLIR using detailed user query and unsupervised corpus-based WSDIJERA Editor
With rapid growth of multilingual information on the Internet, Cross Language Information Retrieval (CLIR) is becoming need of the day. It helps user to query in their native language and retrieve information in any language. But the performance of CLIR is poor as compared to monolingual retrieval due to lexical ambiguity, mismatching of query terms and out-of-vocabulary words. In this paper, we have proposed an algorithm for improving the performance of Marathi-English CLIR system. The system first finds possible translations of input query in target language, disambiguates them and then gives English queries to search engine for relevant document retrieval. The disambiguation is based on unsupervised corpus-based method which uses English dictionary as additional resource. The experiment is performed on FIRE 2011 (Forum of Information Retrieval Evaluation) dataset using “Title” and “Description” fields as inputs. The experimental results show that proposed approach gives better performance of Marathi-English CLIR system with good precision level.
Design and Development of a Malayalam to English Translator- A Transfer Based...Waqas Tariq
This paper describes a transfer based scheme for translating Malayalam, a Dravidian language, to English. This system inputs Malayalam sentences and outputs equivalent English sentences. The system comprises of a preprocessor for splitting the compound words, a morphological parser for context disambiguation and chunking, a syntactic structure transfer module and a bilingual dictionary. All the modules are morpheme based to reduce dictionary size. The system does not rely on a stochastic approach and it is based on a rule-based architecture along with various linguistic knowledge components of both Malayalam and English. The system uses two sets of rules: rules for Malayalam morphology and rules for syntactic structure transfer from Malayalam to English. The system is designed using artificial intelligence techniques.
A New Concept Extraction Method for Ontology Construction From Arabic TextCSCJournals
Ontology is one of the most popular representation model used for knowledge representation, sharing and reusing. The Arabic language has complex morphological, grammatical, and semantic aspects. Due to complexity of Arabic language, automatic Arabic terminology extraction is difficult. In addition, concept extraction from Arabic documents has been challenging research area, because, as opposed to term extraction, concept extraction are more domain related and more selective. In this paper, we present a new concept extraction method for Arabic ontology construction, which is the part of our ontology construction framework. A new method to extract domain relevant single and multi-word concepts in the domain has been proposed, implemented and evaluated. Our method combines linguistic, statistical information and domain knowledge. It first uses linguistic patterns based on POS tags to extract concept candidates, and then stop words filter is implemented to filter unwanted strings. To determine relevance of these candidates within the domain, different statistical measures and new domain relevance measure are implemented for first time for Arabic language. To enhance the performance of concept extraction, a domain knowledge will be integrated into the module. The concepts scores are calculated according to their statistical values and domain knowledge values. In order to evaluate the performance of the method, precision scores were calculated. The results show the high effectiveness of the proposed approach to extract concepts for Arabic ontology construction.
Wavelet Packet Based Features for Automatic Script IdentificationCSCJournals
In a multi script environment, an archive of documents having the text regions printed in different scripts is in practice. For automatic processing of such documents through Optical Character Recognition (OCR), it is necessary to identify different script regions of the document. In this paper, a novel texture-based approach is presented to identify the script type of the collection of documents printed in seven scripts, to categorize them for further processing. The South Indian documents printed in the seven scripts - Kannada, Tamil, Telugu, Malayalam, Urdu, Hindi and English are considered here. The document images are decomposed through the Wavelet Packet Decomposition using the Haar basis function up to level two. The texture features are extracted from the sub bands of the wavelet packet decomposition. The Shannon entropy value is computed for the set of sub bands and these entropy values are combined to use as the texture features. Experimentation conducted involved 2100 text images for learning and 1400 text images for testing. Script classification performance is analyzed using the K-nearest neighbor classifier. The average success rate is found to be 99.68%.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Information retrieval (IR) system aims to retrieve
relevant documents to a user query where the query is a set of
keywords. Cross-language information retrieval (CLIR) is a
retrieval process in which the user fires queries in one language to
retrieve information from another language. The growing
requirement on the Internet for users to access information
expressed in language other than their own has led to Cross
Language Information Retrieval (CLIR) becoming established as
a major topic in IR.
Automatic classification of bengali sentences based on sense definitions pres...ijctcm
Based on the sense definition of words available in the Bengali WordNet, an attempt is made to classify the
Bengali sentences automatically into different groups in accordance with their underlying senses. The input
sentences are collected from 50 different categories of the Bengali text corpus developed in the TDIL
project of the Govt. of India, while information about the different senses of particular ambiguous lexical
item is collected from Bengali WordNet. In an experimental basis we have used Naive Bayes probabilistic
model as a useful classifier of sentences. We have applied the algorithm over 1747 sentences that contain a
particular Bengali lexical item which, because of its ambiguous nature, is able to trigger different senses
that render sentences in different meanings. In our experiment we have achieved around 84% accurate
result on the sense classification over the total input sentences. We have analyzed those residual sentences
that did not comply with our experiment and did affect the results to note that in many cases, wrong
syntactic structures and less semantic information are the main hurdles in semantic classification of
sentences. The applicational relevance of this study is attested in automatic text classification, machine
learning, information extraction, and word sense disambiguation
A Review on the Cross and Multilingual Information Retrievaldannyijwest
In this paper we explore some of the most important areas of information retrieval. In particular, Cross-
lingual Information Retrieval (CLIR) and Multilingual Information Retrieval (MLIR). CLIR deals with
asking questions in one language and retrieving documents in different language. MLIR deals with asking
questions in one or more languages and retrieving documents in one or more different languages. With an
increasingly globalized economy, the ability to find information in other languages is becoming a necessity.
We also presented the evaluation initiatives of information retrieval domain. Finally we have presented the
overall review of the research works in Indian and Foreign languages.
Survey on Indian CLIR and MT systems in Marathi LanguageEditor IJCATR
Cross Language Information Retrieval (CLIR) deals with retrieving relevant information stored in a language different from
the language of user’s query. This helps users to express the information need in their native languages. Machine translation based (MTbased)
approach of CLIR uses existing machine translation techniques to provide automatic translation of queries. This paper covers the
research work done in CLIR and MT systems for Marathi language in India.
An Improved Approach for Word Ambiguity RemovalWaqas Tariq
Word ambiguity removal is a task of removing ambiguity from a word, i.e. correct sense of word is identified from ambiguous sentences. This paper describes a model that uses Part of Speech tagger and three categories for word sense disambiguation (WSD). Human Computer Interaction is very needful to improve interactions between users and computers. For this, the Supervised and Unsupervised methods are combined. The WSD algorithm is used to find the efficient and accurate sense of a word based on domain information. The accuracy of this work is evaluated with the aim of finding best suitable domain of word. Keywords: Human Computer Interaction, Supervised Training, Unsupervised Learning, Word Ambiguity, Word sense disambiguation
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...Syeful Islam
More than hundreds of millions of people of almost all levels of education and attitudes from different country communicate with each other for different purposes using various languages. Machine translation is highly demanding due to increasing the usage of web based Communication. One of the major problem of Bengali translation is identified a naming word from a sentence, which is relatively simple in English language, because such entities start with a capital letter. In Bangla we do not have concept of small or capital letters and there is huge no. of different naming entity available in Bangla. Thus we find difficulties in understanding whether a word is a naming word or not. Here we have introduced a new approach to identify naming word from a Bengali sentence for machine translation system without storing huge no. of naming entity in word dictionary. The goal is to make possible Bangla sentence conversion with minimal storing word in dictionary.
Punjabi to Hindi Transliteration System for Proper Nouns Using Hybrid ApproachIJERA Editor
The language is an effective medium for the communication that conveys the ideas and expression of the human
mind. There are more than 5000 languages in the world for the communication. To know all these languages is
not a solution for problems due to the language barrier in communication. In this multilingual world with the
huge amount of information exchanged between various regions and in different languages in digitized format,
it has become necessary to find an automated process to convert from one language to another. Natural
Language Processing (NLP) is one of the hot areas of research that explores how computers can be utilizing to
understand and manipulate natural language text or speech. In the Proposed system a Hybrid approach to
transliterate the proper nouns from Punjabi to Hindi is developed. Hybrid approach in the proposed system is a
combination of Direct Mapping, Rule based approach and Statistical Machine Translation approach (SMT).
Proposed system is tested on various proper nouns from different domains and accuracy of the proposed system
is very good.
MULTI-WORD TERM EXTRACTION BASED ON NEW HYBRID APPROACH FOR ARABIC LANGUAGEcsandit
Arabic Multiword Term are relevant strings of words in text documents. Once they are
automatically extracted, they can be used to increase the performance of any text mining
applications such as Categorisation, Clustering, Information Retrieval System, Machine
Translation, and Summarization, etc. This paper introduces our proposed Multiword term
extraction system based on the contextual information. In fact, we propose a new method based
a hybrid approach for Arabic Multiword term extraction. Like other method based on hybrid
approach, our method is composed by two main steps: the Linguistic approach and the
Statistical one. In the first step, the Linguistic approach uses Part Of Speech (POS) Tagger
(Taani’s Tagger) and the Sequence Identifier as patterns in order to extract the candidate
AMTWs. While in the second one which includes our main contribution, the Statistical approach
incorporates the contextual information by using a new proposed association measure based on
Termhood and Unithood for AMWTs extraction. To evaluate the efficiency of our proposed
method for AMWTs extraction, this later has been tested and compared using three different
association measures: the proposed one named NTC-Value, NC-Value, and C-Value. The
experimental results using Arabic Texts taken from the environment domain, show that our
hybrid method outperforms the other ones in term of precision, in addition, it can deal correctly
with tri-gram Arabic Multiword terms.
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATAijnlc
Due to the dramatic growth of internet use, the amount of unstructured Bengali text data has increased
enormous. It is therefore essential to extract event intelligently from it. The progress in technologies in
natural language processing (NLP) for information extraction that is used to locate and classify content in
news data according to predefined categories such as person name, place name, organization name, date,
time etc. The current named entity recognition (NER), which is a subtask of NLP, plays a vital rule to
achieve human level performance on specific documents such as newspapers to effectively identify entities.
The purpose of this research is to introduce NER system in Bengali news data to identify events of specified
things in running text based on regular expression and Bengali grammar. In so doing, I have designed and
evaluated part-of-speech (POS) tags to recognize proper nouns. In this thesis, I have explained Hidden
Markov Model (HMM) based approach for developing NER system from Bengali news data.
S URVEY O N M ACHINE T RANSLITERATION A ND M ACHINE L EARNING M ODELSijnlc
Globalization and growth of Internet users truly demands for almost all internet based applications to
support
l
oca
l l
anguages. Support
of
l
oca
l
l
anguages can be
given in all internet based applications by
means of Machine Transliteration
and
Machine Translation
.
This paper provides the thorough survey on
machine transliteration models and machine learning
approaches
used for machine transliteration
over the
period
of more than two decades
for internationally used languages as well as Indian languages.
Survey
shows that linguistic approach provides better results for the closely related languages and probability
based statistical approaches are good when one of the
languages is phonetic and other is non
-
phonetic.
B
etter accuracy can be achieved only by using Hybrid and Combined models.
STUDY OF NAMED ENTITY RECOGNITION FOR INDIAN LANGUAGESijistjournal
Named Entity Recognition is a prior task in Natural Language Processing. Named Entity Recognition is a sub task of information extraction and it identifies and classifies proper nouns in to its predefined categories such as person, location, organization, time, date etc. In this document the major focus is given on NER approaches and the work done till now for various languages to identify Named Entities is been discussed. Author have done comparative study to recognize named entity and identified that CRF approach proven best for Indian languages to identify named entity.
Named Entity Recognition using Hidden Markov Model (HMM)kevig
Named Entity Recognition (NER) is the subtask of Natural Language Processing (NLP) which is the branch
of artificial intelligence. It has many applications mainly in machine translation, text to speech synthesis,
natural language understanding, Information Extraction, Information retrieval, question answering etc.
The aim of NER is to classify words into some predefined categories like location name, person name,
organization name, date, time etc. In this paper we describe the Hidden Markov Model (HMM) based
approach of machine learning in detail to identify the named entities. The main idea behind the use of
HMM model for building NER system is that it is language independent and we can apply this system for
any language domain. In our NER system the states are not fixed means it is of dynamic in nature one can
use it according to their interest. The corpus used by our NER system is also not domain specific.
Named Entity Recognition using Hidden Markov Model (HMM)kevig
Named Entity Recognition (NER) is the subtask of Natural Language Processing (NLP) which is the branch of artificial intelligence. It has many applications mainly in machine translation, text to speech synthesis, natural language understanding, Information Extraction, Information retrieval, question answering etc. The aim of NER is to classify words into some predefined categories like location name, person name,
organization name, date, time etc. In this paper we describe the Hidden Markov Model (HMM) based approach of machine learning in detail to identify the named entities. The main idea behind the use of HMM model for building NER system is that it is language independent and we can apply this system for
any language domain. In our NER system the states are not fixed means it is of dynamic in nature one can use it according to their interest. The corpus used by our NER system is also not domain specific.
SENTIMENT ANALYSIS OF MIXED CODE FOR THE TRANSLITERATED HINDI AND MARATHI TEXTSijnlc
The evolution of information Technology has led to the collection of large amount of data, the volume of
which has increased to the extent that in last two years the data produced is greater than all the data ever
recorded in human history. This has necessitated use of machines to understand, interpret and apply data,
without manual involvement. A lot of these texts are available in transliterated code-mixed form, which due
to the complexity are very difficult to analyze. The work already performed in this area is progressing at
great pace and this work hopes to be a way to push that work further. The designed system is an effort
which classifies Hindi as well as Marathi text transliterated (Romanized) documents automatically using
supervised learning methods (KNN), Naïve Bayes and Support Vector Machine (SVM)) and ontology based
classification; and results are compared to in order to decide which methodology is better suited in
handling of these documents. As we will see, the plain machine learning algorithm applications are just as
or in many cases are much better in performance than the more analytical approach.
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Textskevig
Cyberbullying is currently one of the most important research fields. The majority of researchers have contributed to research on bully text identification in English texts or comments, due to the scarcity of data; analyzing Tamil textstemming is frequently a tedious job. Tamil is a morphologically diverse and agglutinative language. The creation of a Tamil stemmer is not an easy undertaking. After examining the major difficulties encountered, proposed the rule-based iterative preprocessing algorithm (RBIPA). In this attempt, Tamil morphemes and lemmas were extracted using the suffix stripping technique and a supervised machine learning algorithm for classify the word based for pronouns and proper nouns. The novelty of proposed system is developing a preprocessing algorithm for iterative stemming; lemmatize process to discovering exact words from the Tamil Language comments. RBIPA shows 84.96% of accuracy in the given Test Dataset which hasa total of 13000 words.
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Textskevig
Cyberbullying is currently one of the most important research fields. The majority of researchers have contributed to research on bully text identification in English texts or comments, due to the scarcity of data; analyzing Tamil textstemming is frequently a tedious job. Tamil is a morphologically diverse and agglutinative language. The creation of a Tamil stemmer is not an easy undertaking. After examining the major difficulties encountered, proposed the rule-based iterative preprocessing algorithm (RBIPA). In this attempt, Tamil morphemes and lemmas were extracted using the suffix stripping technique and a supervised machine learning algorithm for classify the word based for pronouns and proper nouns. The novelty of proposed system is developing a preprocessing algorithm for iterative stemming; lemmatize process to discovering exact words from the Tamil Language comments. RBIPA shows 84.96% of accuracy in the given Test Dataset which hasa total of 13000 words.
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEijnlc
Manipuri is both a minority and morphologically rich language with genetic features similar to Tibeto Burman languages. It has Subject-Object-Verb (SOV) order, agglutinative verb morphology and is monosyllabic. Morphology and syntax are not clearly distinguished in this language. Natural Language
Processing (NLP) is a useful research field of computer science that deals with processing of a large amount of natural language corpus. The NLP applications encompass E-Dictionary, Morphological Analyzer, Reduplicated Multi-Word Expression (RMWE), Named Entity Recognition (NER), Part of Speech
(POS) Tagging, Machine Translation (MT), Word Net, Word Sense Disambiguation (WSD) etc. In this paper, we present a study on the advancements in NLP applications for Manipuri language, at the same time presenting a comparison table of the approaches and techniques adopted and the results obtained of each of the applications followed by a detail discussion of each work.
A decision tree based word sense disambiguation system in manipuri languageacijjournal
This paper manifests a primary attempt on building a word sense disambiguation system in Manipuri
language. The paper discusses related attempts made in the Manipuri language followed by the proposed
plan. A database, consisting of 650 sentences, is collected in Manipuri language in the course of the study.
Conventional positional and context based features are suggested to capture the sense of the words, which
have ambiguous and multiple senses. The proposed work is expected to predict the senses of the
polysemous words with high accuracy with the help of the suitable knowledge acquisition techniques. The
system produces an accuracy of 71.75 %.
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig
Many Natural Language Processing (NLP) applications involve Named Entity Recognition (NER) as an important task, where it leads to improve the overall performance of NLP applications. In this paper the Deep learning techniques are used to perform NER task on Hindi text data as it found that as compared to English NER, Hindi language NER is not sufficiently done. This is a barrier for resource-scarce languages as many resources are not readily available. Many researchers use various techniques such as rule based, machine learning based and hybrid approaches to solve this problem. Deep learning based algorithms are being developed in large scale as an innovative approach now a days for the advanced NER models which will give the best results out of it. In this paper we devise a Novel architecture based on residual network architecture for preferably Bidirectional Long Short Term Memory (BiLSTM) with fasttext word embedding layers. For this purpose we use pre-trained word embedding to represent the words in the corpus where the NER tags of the words are defined as the used annotated corpora. BiLSTM Development of an NER system for Indian languages is a comparatively difficult task. In this paper, we have done the various experiments to compare the results of NER with normal embedding and fasttext embedding layers to analyse the performance of word embedding with different batch sizes to train the deep learning models. Here we present a state-of-the-art results with said approach F1 Score measures.
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
Natural Language Processing (NLP) techniques are one of the most used techniques in the field of computer applications. It has become one of the vast and advanced techniques. Language is the means of communication or interaction among humans and in present scenario when everything is dependent on machine or everything is computerized, communication between computer and human has become a necessity. To fulfill this necessity NLP has been emerged as the means of interaction which narrows the gap between machines (computers) and humans. It was evolved from the study of linguistics which was passed through the Turing test to check the similarity between data but it was limited to small set of data. Later on various algorithms were developed along with the concept of AI (Artificial Intelligence) for the successful execution of NLP. In this paper, the main emphasis is on the different techniques of NLP which have been developed till now, their applications and the comparison of all those techniques on different parameters.
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEkevig
Manipuri is both a minority and morphologically rich language with genetic features similar to Tibeto Burman languages. It has Subject-Object-Verb (SOV) order, agglutinative verb morphology and ismonosyllabic. Morphology and syntax are not clearly distinguished in this language. Natural Language Processing (NLP) is a useful research field of computer science that deals with processing of a large amount of natural language corpus. The NLP applications encompass E-Dictionary, Morphological
Analyzer, Reduplicated Multi-Word Expression (RMWE), Named Entity Recognition (NER), Part of Speech (POS) Tagging, Machine Translation (MT), Word Net, Word Sense Disambiguation (WSD) etc. In this paper, we present a study on the advancements in NLP applications for Manipuri language, at the same time presenting a comparison table of the approaches and techniques adopted and the results obtained of each of the applications followed by a detail discussion of each work.
Phrase Identification is one of the most critical and widely studied in Natural Language processing (NLP) tasks. Verb Phrase Identification within a sentence is very useful for a variety of application on NLP. One of the core enabling technologies required in NLP applications is a Morphological Analysis. This paper presents the Myanmar Verb Phrase Identification and Translation Algorithm and develops a Markov Model with Morphological Analysis. The system is based on Rule-Based Maximum Matching Approach. In Machine Translation, Large amount of information is needed to guide the translation process. Myanmar Language is inflected language and there are very few creations and researches of Lexicon in Myanmar, comparing to other language such as English, French and Czech etc. Therefore, this system is proposed Myanmar Verb Phrase identification and translation model based on Syntactic Structure and Morphology of Myanmar Language by using Myanmar- English bilingual lexicon. Markov Model is also used to reformulate the translation probability of Phrase pairs. Experiment results showed that proposed system can improve translation quality by applying morphological analysis on Myanmar Language.
Improving performance of english hindi cross language information retrieval u...ijnlc
The main issue in Cross Language Information Retrieval (CLIR) is the poor performance of retrieval in
terms of average precision when compared to monolingual retrieval performance. The main reasons
behind poor performance of CLIR are mismatching of query terms, lexical ambiguity and un-translated
query terms. The existing problems of CLIR are needed to be addressed in order to increase the
performance of the CLIR system. In this paper, we are putting our effort to solve the given problem by
proposed an algorithm for improving the performance of English-Hindi CLIR system. We used all possible
combination of Hindi translated query using transliteration of English query terms and choosing the best
query among them for retrieval of documents. The experiment is performed on FIRE 2010 (Forum of
Information Retrieval Evaluation) datasets. The experimental result show that the proposed approach gives
better performance of English-Hindi CLIR system and also helps in overcoming existing problems and
outperforms the existing English-Hindi CLIR system in terms of average precision.
Natural Language Toolkit (NLTK) is a generic platform to process the data of various natural (human)
languages and it provides various resources for Indian languages also like Hindi, Bangla, Marathi and so
on. In the proposed work, the repositories provided by NLTK are used to carry out the processing of Hindi
text and then further for analysis of Multi word Expressions (MWEs). MWEs are lexical items that can be
decomposed into multiple lexemes and display lexical, syntactic, semantic, pragmatic and statistical
idiomaticity. The main focus of this paper is on processing and analysis of MWEs for Hindi text. The
corpus used for Hindi text processing is taken from the famous Hindi novel “KaramaBhumi by Munshi
PremChand”. The result analysis is done using the Hindi corpus provided by Resource Centre for Indian
Language Technology Solutions (CFILT). Results are analysed to justify the accuracy of the proposed
work.
IMPROVING THE QUALITY OF GUJARATI-HINDI MACHINE TRANSLATION THROUGH PART-OF-S...ijnlc
Machine Translation for Indian languages is an emerging research area. Transliteration is one such module that we design while designing a translation system. Transliteration means mapping of source language text into the target language. Simple mapping decreases the efficiency of overall translation system. We propose the use of stemming and part-of-speech tagging for transliteration. The effectiveness of translation can be improved if we use part-of-speech tagging and stemming assisted transliteration. We have shown that much of the content in Gujarati gets transliterated while being processed for translation to Hindi language.
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSkevig
Text analysis has been attracting increasing attention in this data era. Selecting effective features from
datasets is a particular important part in text classification studies. Feature selection excludes irrelevant
features from the classification task, reduces the dimensionality of a dataset, and improves the accuracy
and performance of identification. So far, so many feature selection methods have been proposed, however,
it remains unclear which method is the most effective in practice. This article focuses on evaluating and
comparing the available feature selection methods in general versatility regarding authorship attribution
problems and tries to identify which method is the most effective. The discussions on general versatility of
feature selection methods and its connection in selecting the appropriate features for varying data were
done. In addition, different languages, different types of features, different systems for calculating the
accuracy of SVM (support vector machine), and different criteria for determining the rank of feature
selection methods were used to measure the general versatility of these methods together. The analysis
results indicate the best feature selection method is different for each dataset; however, some methods can
always extract useful information to discriminate the classes. The chi-square was proved to be a better
method overall.
Similar to Named Entity Recognition System for Hindi Language: A Hybrid Approach (20)
The Use of Java Swing’s Components to Develop a WidgetWaqas Tariq
Widget is a kind of application provides a single service such as a map, news feed, simple clock, battery-life indicators, etc. This kind of interactive software object has been developed to facilitate user interface (UI) design. A user interface (UI) function may be implemented using different widgets with the same function. In this article, we present the widget as a platform that is generally used in various applications, such as in desktop, web browser, and mobile phone. We also describe a visual menu of Java Swing’s components that will be used to establish widget. It will assume that we have successfully compiled and run a program that uses Swing components.
3D Human Hand Posture Reconstruction Using a Single 2D ImageWaqas Tariq
Passive sensing of the 3D geometric posture of the human hand has been studied extensively over the past decade. However, these research efforts have been hampered by the computational complexity caused by inverse kinematics and 3D reconstruction. In this paper, our objective focuses on 3D hand posture estimation based on a single 2D image with aim of robotic applications. We introduce the human hand model with 27 degrees of freedom (DOFs) and analyze some of its constraints to reduce the DOFs without any significant degradation of performance. A novel algorithm to estimate the 3D hand posture from eight 2D projected feature points is proposed. Experimental results using real images confirm that our algorithm gives good estimates of the 3D hand pose. Keywords: 3D hand posture estimation; Model-based approach; Gesture recognition; human- computer interface; machine vision.
Camera as Mouse and Keyboard for Handicap Person with Troubleshooting Ability...Waqas Tariq
Camera mouse has been widely used for handicap person to interact with computer. The utmost important of the use of camera mouse is must be able to replace all roles of typical mouse and keyboard. It must be able to provide all mouse click events and keyboard functions (include all shortcut keys) when it is used by handicap person. Also, the use of camera mouse must allow users troubleshooting by themselves. Moreover, it must be able to eliminate neck fatigue effect when it is used during long period. In this paper, we propose camera mouse system with timer as left click event and blinking as right click event. Also, we modify original screen keyboard layout by add two additional buttons (button “drag/ drop” is used to do drag and drop of mouse events and another button is used to call task manager (for troubleshooting)) and change behavior of CTRL, ALT, SHIFT, and CAPS LOCK keys in order to provide shortcut keys of keyboard. Also, we develop recovery method which allows users go from camera and then come back again in order to eliminate neck fatigue effect. The experiments which involve several users have been done in our laboratory. The results show that the use of our camera mouse able to allow users do typing, left and right click events, drag and drop events, and troubleshooting without hand. By implement this system, handicap person can use computer more comfortable and reduce the dryness of eyes.
A Proposed Web Accessibility Framework for the Arab DisabledWaqas Tariq
The Web is providing unprecedented access to information and interaction for people with disabilities. This paper presents a Web accessibility framework which offers the ease of the Web accessing for the disabled Arab users and facilitates their lifelong learning as well. The proposed framework system provides the disabled Arab user with an easy means of access using their mother language so they don’t have to overcome the barrier of learning the target-spoken language. This framework is based on analyzing the web page meta-language, extracting its content and reformulating it in a suitable format for the disabled users. The basic objective of this framework is supporting the equal rights of the Arab disabled people for their access to the education and training with non disabled people. Key Words : Arabic Moon code, Arabic Sign Language, Deaf, Deaf-blind, E-learning Interactivity, Moon code, Web accessibility , Web framework , Web System, WWW.
Real Time Blinking Detection Based on Gabor FilterWaqas Tariq
New method of blinking detection is proposed. The utmost important of blinking detections method is robust against different users, noise, and also change of eye shape. In this paper, we propose blinking detections method by measuring of distance between two arcs of eye (upper part and lower part). We detect eye arcs by apply Gabor filter onto eye image. As we know that Gabor filter has advantage on image processing application since it able to extract spatial localized spectral features, such line, arch, and other shape are more easily detected. After two of eye arcs are detected, we measure the distance between both by using connected labeling method. The open eye is marked by the distance between two arcs is more than threshold and otherwise, the closed eye is marked by the distance less than threshold. The experiment result shows that our proposed method robust enough against different users, noise, and eye shape changes with perfectly accuracy.
Computer Input with Human Eyes-Only Using Two Purkinje Images Which Works in ...Waqas Tariq
A method for computer input with human eyes-only using two Purkinje images which works in a real time basis without calibration is proposed. Experimental results shows that cornea curvature can be estimated by using two light sources derived Purkinje images so that no calibration for reducing person-to-person difference of cornea curvature. It is found that the proposed system allows usersf movements of 30 degrees in roll direction and 15 degrees in pitch direction utilizing detected face attitude which is derived from the face plane consisting three feature points on the face, two eyes and nose or mouth. Also it is found that the proposed system does work in a real time basis.
Toward a More Robust Usability concept with Perceived Enjoyment in the contex...Waqas Tariq
Mobile multimedia service is relatively new but has quickly dominated people¡¯s lives, especially among young people. To explain this popularity, this study applies and modifies the Technology Acceptance Model (TAM) to propose a research model and conduct an empirical study. The goal of study is to examine the role of Perceived Enjoyment (PE) and what determinants can contribute to PE in the context of using mobile multimedia service. The result indicates that PE is influencing on Perceived Usefulness (PU) and Perceived Ease of Use (PEOU) and directly Behavior Intention (BI). Aesthetics and flow are key determinants to explain Perceived Enjoyment (PE) in mobile multimedia usage.
Collaborative Learning of Organisational KnolwedgeWaqas Tariq
This paper presents recent research into methods used in Australian Indigenous Knowledge sharing and looks at how these can support the creation of suitable collaborative envi- ronments for timely organisational learning. The protocols and practices as used today and in the past by Indigenous communities are presented and discussed in relation to their relevance to a personalised system of knowledge sharing in modern organisational cultures. This research focuses on user models, knowledge acquisition and integration of data for constructivist learning in a networked repository of or- ganisational knowledge. The data collected in the repository is searched to provide collections of up-to-date and relevant material for training in a work environment. The aim is to improve knowledge collection and sharing in a team envi- ronment. This knowledge can then be collated into a story or workflow that represents the present knowledge in the organisation.
Our research aims to propose a global approach for specification, design and verification of context awareness Human Computer Interface (HCI). This is a Model Based Design approach (MBD). This methodology describes the ubiquitous environment by ontologies. OWL is the standard used for this purpose. The specification and modeling of Human-Computer Interaction are based on Petri nets (PN). This raises the question of representation of Petri nets with XML. We use for this purpose, the standard of modeling PNML. In this paper, we propose an extension of this standard for specification, generation and verification of HCI. This extension is a methodological approach for the construction of PNML with Petri nets. The design principle uses the concept of composition of elementary structures of Petri nets as PNML Modular. The objective is to obtain a valid interface through verification of properties of elementary Petri nets represented with PNML.
Development of Sign Signal Translation System Based on Altera’s FPGA DE2 BoardWaqas Tariq
The main aim of this paper is to build a system that is capable of detecting and recognizing the hand gesture in an image captured by using a camera. The system is built based on Altera’s FPGA DE2 board, which contains a Nios II soft core processor. Image processing techniques and a simple but effective algorithm are implemented to achieve this purpose. Image processing techniques are used to smooth the image in order to ease the subsequent processes in translating the hand sign signal. The algorithm is built for translating the numerical hand sign signal and the result are displayed on the seven segment display. Altera’s Quartus II, SOPC Builder and Nios II EDS software are used to construct the system. By using SOPC Builder, the related components on the DE2 board can be interconnected easily and orderly compared to traditional method that requires lengthy source code and time consuming. Quartus II is used to compile and download the design to the DE2 board. Then, under Nios II EDS, C programming language is used to code the hand sign translation algorithm. Being able to recognize the hand sign signal from images can helps human in controlling a robot and other applications which require only a simple set of instructions provided a CMOS sensor is included in the system.
An overview on Advanced Research Works on Brain-Computer InterfaceWaqas Tariq
A brain–computer interface (BCI) is a proficient result in the research field of human- computer synergy, where direct articulation between brain and an external device occurs resulting in augmenting, assisting and repairing human cognitive. Advanced works like generating brain-computer interface switch technologies for intermittent (or asynchronous) control in natural environments or developing brain-computer interface by Fuzzy logic Systems or by implementing wavelet theory to drive its efficacies are still going on and some useful results has also been found out. The requirements to develop this brain machine interface is also growing day by day i.e. like neuropsychological rehabilitation, emotion control, etc. An overview on the control theory and some advanced works on the field of brain machine interface are shown in this paper.
Exploring the Relationship Between Mobile Phone and Senior Citizens: A Malays...Waqas Tariq
There is growing ageing phenomena with the rise of ageing population throughout the world. According to the World Health Organization (2002), the growing ageing population indicates 694 million, or 223% is expected for people aged 60 and over, since 1970 and 2025.The growth is especially significant in some advanced countries such as North America, Japan, Italy, Germany, United Kingdom and so forth. This growing older adult population has significantly impact the social-culture, lifestyle, healthcare system, economy, infrastructure and government policy of a nation. However, there are limited research studies on the perception and usage of a mobile phone and its service for senior citizens in a developing nation like Malaysia. This paper explores the relationship between mobile phones and senior citizens in Malaysia from the perspective of a developing country. We conducted an exploratory study using contextual interviews with 5 senior citizens of how they perceive their mobile phones. This paper reveals 4 interesting themes from this preliminary study, in addition to the findings of the desirable mobile requirements for local senior citizens with respect of health, safety and communication purposes. The findings of this study bring interesting insight to local telecommunication industries as a whole, and will also serve as groundwork for more in-depth study in the future.
Principles of Good Screen Design in WebsitesWaqas Tariq
Visual techniques for proper arrangement of the elements on the user screen have helped the designers to make the screen look good and attractive. Several visual techniques emphasize the arrangement and ordering of the screen elements based on particular criteria for best appearance of the screen. This paper investigates few significant visual techniques in various web user interfaces and showcases the results for better understanding and their presence.
Virtual teams are used more and more by companies and other organizations to receive benefits. They are a great way to enable teamwork in situations where people are not sitting in the same physical place at the same time. As companies seek to increase the use of virtual teams, a need exists to explore the context of these teams, the virtuality of a team and software that may help these teams working virtualy. Virtual teams have the same basic principles as traditional teams, but there is one big difference. This difference is the way the team members communicate. Instead of using the dynamics of in-office face-to-face exchange, they now rely on special communication channels enabled by modern technologies, such as e-mails, faxes, phone calls and teleconferences, virtual meetings etc. This is why this paper is focused on the issues regarding virtual teams, and how these teams are created and progressing in Albania.
Cognitive Approach Towards the Maintenance of Web-Sites Through Quality Evalu...Waqas Tariq
It is a well established fact that the Web-Applications require frequent maintenance because of cutting– edge business competitions. The authors have worked on quality evaluation of web-site of Indian ecommerce domain. As a result of that work they have made a quality-wise ranking of these sites. According to their work and also the survey done by various other groups Futurebazaar web-site is considered to be one of the best Indian e-shopping sites. In this research paper the authors are assessing the maintenance of the same site by incorporating the problems incurred during this evaluation. This exercise gives a real world maintainability problem of web-sites. This work will give a clear picture of all the quality metrics which are directly or indirectly related with the maintainability of the web-site.
USEFul: A Framework to Mainstream Web Site Usability through Automated Evalua...Waqas Tariq
A paradox has been observed whereby web site usability is proven to be an essential element in a web site, yet at the same time there exist an abundance of web pages with poor usability. This discrepancy is the result of limitations that are currently preventing web developers in the commercial sector from producing usable web sites. In this paper we propose a framework whose objective is to alleviate this problem by automating certain aspects of the usability evaluation process. Mainstreaming comes as a result of automation, therefore enabling a non-expert in the field of usability to conduct the evaluation. This results in reducing the costs associated with such evaluation. Additionally, the framework allows the flexibility of adding, modifying or deleting guidelines without altering the code that references them since the guidelines and the code are two separate components. A comparison of the evaluation results carried out using the framework against published evaluations of web sites carried out by web site usability professionals reveals that the framework is able to automatically identify the majority of usability violations. Due to the consistency with which it evaluates, it identified additional guideline-related violations that were not identified by the human evaluators.
Robot Arm Utilized Having Meal Support System Based on Computer Input by Huma...Waqas Tariq
A robot arm utilized having meal support system based on computer input by human eyes only is proposed. The proposed system is developed for handicap/disabled persons as well as elderly persons and tested with able persons with several shapes and size of eyes under a variety of illumination conditions. The test results with normal persons show the proposed system does work well for selection of the desired foods and for retrieve the foods as appropriate as usersf requirements. It is found that the proposed system is 21% much faster than the manually controlled robotics.
Parameters Optimization for Improving ASR Performance in Adverse Real World N...Waqas Tariq
From the existing research it has been observed that many techniques and methodologies are available for performing every step of Automatic Speech Recognition (ASR) system, but the performance (Minimization of Word Error Recognition-WER and Maximization of Word Accuracy Rate- WAR) of the methodology is not dependent on the only technique applied in that method. The research work indicates that, performance mainly depends on the category of the noise, the level of the noise and the variable size of the window, frame, frame overlap etc is considered in the existing methods. The main aim of the work presented in this paper is to use variable size of parameters like window size, frame size and frame overlap percentage to observe the performance of algorithms for various categories of noise with different levels and also train the system for all size of parameters and category of real world noisy environment to improve the performance of the speech recognition system. This paper presents the results of Signal-to-Noise Ratio (SNR) and Accuracy test by applying variable size of parameters. It is observed that, it is really very hard to evaluate test results and decide parameter size for ASR performance improvement for its resultant optimization. Hence, this study further suggests the feasible and optimum parameter size using Fuzzy Inference System (FIS) for enhancing resultant accuracy in adverse real world noisy environmental conditions. This work will be helpful to give discriminative training of ubiquitous ASR system for better Human Computer Interaction (HCI). Keywords: ASR Performance, ASR Parameters Optimization, Multi-Environmental Training, Fuzzy Inference System for ASR, ubiquitous ASR system, Human Computer Interaction (HCI)
Interface on Usability Testing Indonesia Official Tourism WebsiteWaqas Tariq
Ministry of Tourism and Creative Economy of The Republic of Indonesia must meet the wide audience various needs and should reach people from all levels of society around the world to provide Indonesia tourism and travel information. This article will gives the details in the evolution of one important component of Indonesia Official Tourism Website as it has grown in functionality and usefulness over several years of use by a live, unrestricted community. We chose this website to see the website interface design and usability and to popularize Indonesia tourism and travel highlights. The analysis done by looking at the criteria specified for usability testing. Usability testing measures are the ease of use (effectiveness, efficiency, consistency and interface design), easy to learn, errors and syntax which is related to the human computer interaction. The purpose of this article is to test the usability level of the website, analyze the website interface design, and provide suggestions for improvements in Indonesia Official Tourism Website of analysis we have done before.
Monitoring and Visualisation Approach for Collaboration Production Line Envir...Waqas Tariq
In this paper, a tool, called SPMonitor, to monitor and visualize of run-time execution productive processes is proposed. SPMonitor enables dynamically visualizing and monitoring workflows running in a system. It displays versatile information about currently executed workflows providing better understanding about processes and the general functionality of the domain. Moreover, SPMonitor enhances cooperation between different stakeholders by offering extensive communication and problem solving features that allow actors concerned to react more efficiently to different anomalies that may occur during a workflow execution. The ideas discussed are validated through the study of real case related to airbus assembly lines.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Embracing GenAI - A Strategic ImperativePeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
2024.06.01 Introducing a competency framework for languag learning materials ...
Named Entity Recognition System for Hindi Language: A Hybrid Approach
1. Shilpi Srivastava, Mukund Sanglikar & D.C Kothari
International Journal of Computational Linguistics (IJCL), Volume (2) : Issue (1) : 2011 10
Named Entity Recognition System for Hindi Language: A Hybrid
Approach
Shilpi Srivastava shilpii26@gmail.com
Department of Computer Science
University of Mumbai, Vidyanagri, Santacruz (E)
Mumbai-400098, India
Mukund Sanglikar masanglikar@rediffmail.com
Professor, Department of Mathematics,
Mithibai college, Vile Parle (W), University of Mumbai
Mumbai-400056, India
D.C Kothari kothari@mu.ac.in
Professor, Department of Physics,
University of Mumbai, Vidyanagri, Santacruz(E)
Mumbai-400098, India
Abstract
Named Entity Recognition (NER) is a major early step in Natural Language
Processing (NLP) tasks like machine translation, text to speech synthesis,
natural language understanding etc. It seeks to classify words which represent
names in text into predefined categories like location, person-name, organization,
date, time etc. In this paper we have used a combination of machine learning and
Rule based approaches to classify named entities. The paper introduces a hybrid
approach for NER. We have experimented with Statistical approaches like
Conditional Random Fields (CRF) & Maximum Entropy (MaxEnt) and Rule based
approach based on the set of linguistic rules. Linguistic approach plays a vital
role in overcoming the limitations of statistical models for morphologically rich
language like Hindi. Also the system uses voting method to improve the
performance of the NER system.
Keywords: NER, MaxEnt, CRF, Rule base, Voting, Hybrid Approach
1. INTRODUCTION
Named Entity Recognition is a subtask of Information extraction where we locate and classify
proper names in text into predefined categories. NER is a precursor for many natural languages
processing tasks. An accurate NER system is needed for machine translation, more accurate
internet search engines, automatic indexing of documents, automatic question-answering,
information retrieval etc
Most NER systems use a rule based approach or statistical machine learning approach or a
combination of these. A Rule-based NER system uses hand-written rules to tag a corpus with
named entity (NE) tags. Machine-learning (ML) approaches are popularly used in NER because
these are easily trainable, adaptable to different domains and languages and their maintenance is
less expensive. A hybrid NER system is a combination of both rule-based and statistical
approaches.
2. Shilpi Srivastava, Mukund Sanglikar & D.C Kothari
International Journal of Computational Linguistics (IJCL), Volume (2) : Issue (1) : 2011 11
Not much work has been done on NER for Indian languages like Hindi. Hindi is the third most
spoken language of the world and still no accurate Hindi NER system exists. As some features
like capitalization are not available in Hindi and due to lack of a large labeled dataset and of
standardization and spelling variations, an English NER system cannot be used directly for Hindi.
There is a need to develop an accurate Hindi NER system for better presence of Hindi on the
internet. It is necessary to understand Hindi language structure and learn new features for
building better Hindi NER systems.
In this paper, we have reported a NER system for Hindi by using the classifiers, namely MaxEnt,
CRF and Rulebase model. We have demonstrated a comparative study of performance of the
two statistical classifiers ( MaxEnt & CRF) widely used in NLP tasks, and use a novel voting
mechanism based on classification confidence (that has a statistical validity) to combine the two
classifiers among with preliminary handcrafted rules.
Our proposed system is an attempt to illustrate the hybrid approach for Hindi Named Entity
Recognition. The system makes use of some POS information of the words along with the variety
of orthographic word level features that are helpful in predicting the various NE classes.
Theoretically it is known that CRF is better than MaxEnt due to the label bias problem of MaxEnt.
The main contribution of this work is to make a comparative study between the two classifiers
MaxEnt and CRF and Results show that CRF always gave better results in comparison to
MaxEnt.
In the following sections, we will discuss about previous works, the issues in Hindi language &
various approaches for NER task and examine our approach, design and implementation details,
results and concluding discussion.
2. RELATED WORKS
NER has drawn more and more attention from NLP researchers since the last decade (Chinchor
1995, Chinchor 1998) [5] [18]. Two generally classified approaches to NER are Linguistic
approach and Machine learning (ML) based approach. The Linguistics approach uses rule-based
models manually written by linguists. ML based techniques make use of a large amount of
annotated training data to acquire high-level language knowledge. Various ML techniques which
are used for the NER task are Hidden Markov Model (HMM) [7], Maximum Entropy Model
(MaxEnt) [6], Decision Tree [3], Support Vector Machines [4] and Conditional Random Fields
(CRFs) [10]. Both the approaches may make use of gazetteer information to build system
because it improves the accuracy.
Ralph Grishman in 1995 developed a rule-based NER system which uses some specialized
name dictionaries including names of all countries, names of major cities, names of companies,
common first names etc [15]. Another rule-based NER system is developed in 1996 which make
use of several gazetteers like organization names, location names, person names, human titles
etc [16]. But the main disadvantages of these rule based techniques are that these require huge
experience and grammatical knowledge of particular languages or domains and these systems
are not transferable to other languages.
Here we mention a few NER systems that have used ML techniques. ‘Identifinder’ is one of the
first generation ML based NER systems which used Hidden Markov Model (HMM) [7]. By using
mainly capital letter and digit information, this system achieved F-value of 87.6 on English.
Borthwick used MaxEnt in his NER system with lexical information, section information and
dictionary features [6]. He had also shown that ML approaches can be combined with hand-
coded systems to achieve better performance. He was able to develop a 92% accurate English
NER system. Mikheev et al. has also developed a hybrid system containing statistical and hand
coded system that achieved F-value of 93.39 [17].
3. Shilpi Srivastava, Mukund Sanglikar & D.C Kothari
International Journal of Computational Linguistics (IJCL), Volume (2) : Issue (1) : 2011 12
Other ML approaches like Support Vector Machine (SVM), Conditional Random Field (CRF), and
Maximum Entropy Markov Model (MEMM) are also used in developing NER systems.
Combinations of different ML approaches are also used. For example, we can mention a system
developed by Srihari et al., which combined several modules, built by using MaxEnt, HMM and
handcrafted rules, that achieved F-value of 93.5 [19].
The NER task for Hindi has been explored by Cucerzan and Yarowsky in their language
independent NER which used morphological and contextual evidences [20]. They ran their
experiments with 5 languages: Romanian, English, Greek, Turkish and Hindi. Among these, the
accuracy for Hindi was the worst. A Recent Hindi NER system is developed by Li and McCallum
using CRF with feature induction [21]. They automatically discovered relevant features by
providing a large array of lexical tests and using feature induction to automatically construct the
features that mostly increase conditional likelihood. However the performance of these systems is
significantly hampered when the test corpus is not similar to the training corpus. Few studies
(Guo et al., 2009), (Poibeau and Kosseim, 2001) have been performed towards genre/domain
adaptation. But this still remains an open area. In IJCNLP-08 workshop on NER for South and
South East Asian languages, held in 2008 at IIIT Hyderabad, was a major attempt in introducing
NER for Indian languages that concentrated on five Indian languages- Hindi, Bengali, Oriya,
Telugu and Urdu. As part of this shared task, [22] reported a CRF-based system followed by
post-processing which involves using some heuristics or rules. Some efforts for Indian Language
have also been made [23 [24]. A CRF-based system has been reported in [25], where it has been
shown that the hybrid CRF based model can perform better than CRF. [26] presents a hybrid
approach for identifying Hindi names, using knowledge infusion from multiple sources of
evidence.
The authors, to the best of their knowledge and efforts have not encountered a work which
demonstrates a comparative study between the two classifiers MaxEnt and CRF and uses a
hybrid model based on MaxEnt, CRF and Rulebase for Hindi Named Entity Recognition.
3. ISSUES WITH HINDI LANGUAGE
The task of building a named entity recognizer for Hindi language presents several issues related
to their linguistic characteristics. There are some issues faced by Hindi and other Indian
languages:
• No capitalization: Unlike English and most of the European languages, Indian languages
lack the capitalization information that plays a very important role to identify NEs in those
languages. Hence English NER systems can exploit the feature of capitalization to its
advantage because all English names always start with capital letters while Hindi names
don’t have scripts with graphical cues like capitalization, which could act as an important
indicator for NER.
• Ambiguous names: Hindi names are ambiguous and this issue makes the recognition a
very difficult task. One of the features of the named entities in Hindi language is the high
overlap between common nouns and proper nouns. Indian person names are more
diverse compared to those of most other languages and a lot of them can be found in the
dictionary as common nouns.
• Scarcity of resources and tools: Hindi, like other Indian languages, is also a resource
poor language. Annotated corpora, name dictionaries, good morphological analyzers,
POS taggers etc. are not yet available in the required quantity and quality.
• Lack of standardization and spelling: Another important language related issue is the
variation in the spellings of proper names. This increases the number of tokens to be
learnt by the machine and would perhaps also require a higher level task like co-
occurrence resolution.
4. Shilpi Srivastava, Mukund Sanglikar & D.C Kothari
International Journal of Computational Linguistics (IJCL), Volume (2) : Issue (1) : 2011 13
• Free word order language: Indian languages have relatively free word order.
• Web sources for name lists are available in English, but such lists are not available in
Indian languages.
• Although Indian languages have a very old and rich literary history still technology
development are recent.
• Indian languages are highly inflected and provide rich and challenging sets of linguistic
and statistical features resulting in long and complex word forms.
• Lack of labeled data.
• Non-availability of large gazetteer:
4. VARIOUS APPROACHES FOR NER
There are three basic approaches to NER [1]. They are rule based approach, statistical or
machine learning approach and hybrid approach.
4.1 Rule Based Approach
It uses linguistic grammar-based techniques to find named entity (NE) tags. It needs rich and
expressive rules and gives good results. It requires great knowledge of grammar and other
language related rules. Good experience is needed to come up with good rules and heuristics. It
is not easily portable and has high acquisition cost. It is very specific to the target data.
4.2 Statistical Methods or Machine Learning Methods
The common machine learning models used for NER are:
• HMM [14]: HMM stands for Hidden Markov Model. HMM is a generative model. The
model assigns the joint probability to paired observation and label sequence. Then the
parameters are trained to maximize the joint likelihood of training sets.
It is advantageous as its basic theory is elegant and easy to understand. Hence it is
easier to implement and analyze. It uses only positive data, so they can be easily scaled.
It has few disadvantages. In order to define joint probability over observation and label
sequence HMM needs to enumerate all possible observation sequence. Hence it makes
various assumptions about data like Markovian assumption i.e. current label depends
only on the previous label. Also it is not practical to represent multiple overlapping
features and long term dependencies. Number of parameter to be evaluated is huge. So
it needs a large data set for training.
• MaxEnt [6]: MaxEnt stands for Maximum Entropy Markov Model (MEMM). It is a
conditional probabilistic sequence model. It can represent multiple features of a word and
can also handle long term dependency. It is based on the principle of maximum entropy
which states that the least biased model which considers all know facts is the one which
maximizes entropy. Each source state has a exponential model that takes the
observation feature as input and output a distribution over possible next state. Output
labels are associated with states.
It solves the problem of multiple feature representation and long term dependency issue
faced by HMM. It has generally increased recall and greater precision than HMM.
It also has some disadvantages. It has Label Bias Problem. The probability transition
5. Shilpi Srivastava, Mukund Sanglikar & D.C Kothari
International Journal of Computational Linguistics (IJCL), Volume (2) : Issue (1) : 2011 14
leaving any given state must sum to one. So it is biased towards states with lower
outgoing transitions. The state with single outgoing state transition will ignore all
observations. To handle Label Bias Problem we can change the state-transition.
• CRF [10]: CRF stands for Conditional Random Field. It is a type of discriminative
probabilistic model. It has all the advantages of MEMMs without the label bias problem.
CRFs are undirected graphical models (also know as random field) which is used to
calculate the conditional probability of values on assigned output nodes given the values
assigned to other assigned input nodes..
4.3 Hybrid Models
Hybrid models are basically combination of rules based and statistical models. In Hybrid NER
system, approach uses the combination of both rule-based and ML technique and makes new
methods using strongest points from each method. It is making use of essential feature from ML
approaches and uses the rules to make it more efficient.
5. OUR APPROACH
5.1 CRF Based Machine Learning
The basis idea of CRF is to construct a conditional probability ( | )P Y X from the label sequence
Y (e.g. NE tags) and observation sequence X (e.g. words) after model is constructed, then
testing can be done by ending the label that maximizes ( | )P Y X for the observed features.
Definition [10]: " Let ( , )G V E= be a graph such that ( )vY Y v V= ∈ , so that Y is indexed by
the vertices of G. Then ( , )X Y is a conditional random field in case, when conditioned on X ,
the random variables vY obey the Markov Property with respect to the graph:
( | , , ) ( | , , ~ )v w v wP Y X Y w v P Y X Y w v≠ = ; where w ~ v means that w and v are neighbors in
G."
“Lafferty et. al [10] define the probability of a particular label sequence Y given the observation
sequence X to be a normalized product of potential functions each of the form,
1exp ( , , , ) ( , , )j j i i k k i
j k
t y y x i s y x iλ µ−
+
∑ ∑
Where 1( , , , )j i it y y x i− is a transition feature function of the entire observation sequence and
the labels at positions i and i -1 in the label sequence; ( , , )k is y x i is a state feature function of
the label at position i and the observation sequence; and jλ and kµ are parameters to be
estimated from training data.
Final expression of probability of a label sequence Y given an observation sequence X is
1
1
1
( | , ) exp ( , , , )
( )
n
j i i i
i j
p y x f y y x i
Z x
λ λ −
=
=
∑∑ Where 1( , , , )i i if y y x i− is either a state
function 1( , , , )i is y y x i− or a transition function 1( , , , )i it y y x i− .” [13]
We are using mallet-0.4 [12] for training and testing. Mallet provides SimpleTagger program that
takes input as a file in mallet format of Figure 1. After training the model is saved in a file. Then
model file can be used for testing. When trained model is tested, it produces an output file that
6. Shilpi Srivastava, Mukund Sanglikar & D.C Kothari
International Journal of Computational Linguistics (IJCL), Volume (2) : Issue (1) : 2011 15
contains the predicted tags of the word. The predicted tags are present in the same line number
as the text file.
FIGURE 1: Data in mallet format
5.2 MaxEnt Based Machine Learning
It is based on the principle of maximum entropy which states that the least biased model which
considers all know facts is the one which maximizes entropy.
Let H be the set of histories and T be the set of allowable tags.
The maximum entropy model is defined over H T× .
The model's probability is defined as probability of history h with tag.
( , )
( , ) if h t
j
j
p h t πµ α= ∏
Where,
π is normalization constant
, jµ α are model parameters
( , )if h t feature function
Let ( )L p = likelihood of training data using distribution,
1
( ) ( , )
n
i i
i
L p p h t
=
= ∏
The method is to choose the model parameters correctly with respect to maximum likelihood
principle.
We are using mallet-0.4 MaxEnt implementation. For the purpose of training and testing using
MaxEnt, we created file MaxEntTagger which converts the input file in format specified in Figure 1
into their internal data structure. The file is similar to SimpleTagger. Then the training and testing
is done similar to CRF.
5.3 Rule Based Model
Following rules were used to get NE tags from words
• <ne=NEN>: For numbers written in Hindi font like ek, paanch etc, word matching with
dictionary is used. The file contain Hindi number words are provided by Hindi Wordnet
[11]. If the number contains only digits then it is NEN.
7. Shilpi Srivastava, Mukund Sanglikar & D.C Kothari
International Journal of Computational Linguistics (IJCL), Volume (2) : Issue (1) : 2011 16
• <ne=NEL>: Use dictionary matching for common locations like Bharat(India), Kanpur.
Also used suffix matching like words ending with "pur" are generally cities like Kanpur,
Nagpur, Jodhpur etc.
• <ne=NEB>: Used dictionary matching.
• <ne=NETI>: Used regular expression matching e.g. 12-3-2008 format is NETI
• <ne=NEP>: Suffix matching is used with common surnames like Sharma, Agrawal,
Kumar etc
• <ne=NED>: Prefix matching with common designation like doctor, raja, pradhanmantri
etc.
5.4 Voting
In Voting we use the results of CRF, MaxEnt and Rule Based model to get a better model. We
have NE tags including "none". For each word the weight of these tags is initialized 0. Now when
the word is predicted as some NE tag by a model then the weight of that tag is increased. The
final answer is the tag which has highest weight.
Some heuristics are used to improve the accuracy of model. Like weight of NEM tags predicted
by rule based model is kept high as they generally predict correct NE tag. If two tags are same
then the answer is that tag.
6. DESIGN & IMPLEMENTATION
6.1 Data and Tools
• Dataset: Named Entity Annotated Corpus for Hindi. The data is obtained from IJCNLP-
08 website [8]. SSF format [9] is used for representing the annotated Hindi corpus. The
annotation was performed manually by IIIT Hyderabad.
• Dictionary Source: We have used files containing common Hindi nouns, verbs,
adjectives, adverbs for Parts-of-speech (POS) tagging. The files are obtained from Hindi
Wordnet, IIT Mumbai [11].
• Tools: Mallet-0.4 [12] is used for training and testing machine learning based models
CRF [10] and MaxEnt [6]. For CRF, a SimpleTagger is provided which takes input as a
file containing word followed by word features (noun, verb, number etc) and Named
Entity (NE) tag for training. A SimpleTagger program converts the file into suitable data
structures used by CRF for training.
e.g. Training file format:
Word feaure_1 feature_2 ... feature_n NE_tag
ek noun adj number <ne=NEN>
adhik adj adv none
Here word "ek" has 3 features namely noun, adj and number. Its NE tag is <ne=NEN>.
Second word "adhik" has 2 features namely adj and adv and it has NE tag none.
For testing the file format is same except it doesn't contain NE tags at last of each
sentence i.e. it only contains words followed by its features
For MaxEnt, we created MaxEntTagger.java to process the input file and use them to test
and train MaxEnt model.
8. Shilpi Srivastava, Mukund Sanglikar & D.C Kothari
International Journal of Computational Linguistics (IJCL), Volume (2) : Issue (1) : 2011 17
• Tagset Used: Table 1 [2] contains the list Named Entity tagset used in the corpus.
• Programming Language & utility: Java, bash script, awk, grep
Tags Names Description
<ne=NEP> Person Bob Dylan, Mohandas Gandhi
<ne=NED> Designation General Manager, Commissioner
<ne=NEO> Organization Municipal Corporation
<ne=NEA> Abbreviation NLP, B.J.P.
<ne=NEB> Brand Pepsi, Nike (ambiguous)
<ne=NETP> Title Person Mahatma, Dr., Mr.
<ne=NETO> Title Object Pride and Prejudice, Othello
<ne=NEL> Location New Delhi, Paris
<ne=NETI> Time 3rd September, 1991(ambiguous)
<ne=NEN> Number 3.14, 4,500
<ne=NEM> Measure Rs. 4,500, 5 kg
<ne=NETE> Terms Maximum Entropy, Archeology
None Not a named entity Rain, go, hai, ka, ke , ki
TABLE 1: The named entity tagset used for shared task
6.2 Design Schemes
• Editing Data: The first objective is to convert annotated Hindi corpus given in SSF format
to new format that can be used by mallet-0.4 models CRF and MaxEnt for training and
testing. SSF format like the example given in Figure 2 contains many things like line
number, braces, <Sentence id=""> etc that are not present in mallet format (e.g. data
format of Figure 3). NE tags are present in different line in SSF, which need to put after
the word for mallet format. Also some words which represents a NE tag when combined
like "narad muni" in Figure 2 needs to be concatenated. After writing each word in
different line with their NE tags, we need to find features for each word.
• Features: Here we used mostly orthographic features like other researchers have been
using. Features of words include
• Symbol: If the word is symbol like "?", ",", ";", "." etc
• Noun: If word is noun
• Adj: The word is adjective
• Adv: adverb
• Verb: verb
• First Word: If the word is first word of a sentence
• Number: If the word is a number like ek, paanch, or 123,
• Num Start: If the word starts with number line 123_kg
Features of the words are added using some rule based matching (like for numbers) and
from dictionary matching of words with the words which are obtained from Hindi wordnet,
IIT Mumbai [11] (like noun, verb).
• Training and Testing on Mallet: The model is trained on 10, 50, 100, and 150 training
files respectively. Then each trained model is tested on 10 files on which the model is not
trained. The files on which the model is trained and tested are obtained randomly from
the dataset. This process is done for 10 times. The average and good results of these
tests are reported in the Results section. This is done for both CRF and MaxEnt model on
the given data.
9. Shilpi Srivastava, Mukund Sanglikar & D.C Kothari
International Journal of Computational Linguistics (IJCL), Volume (2) : Issue (1) : 2011 18
FIGURE 2: Data in SSF format
FIGURE 3: Data in mallet format after conversion from SSF
• Test Dataset using Rule Based Models: test all datasets for Rule based models.
• Improve Accuracy by Voting: The output of each of the above method (CRF, MaxEnt,
rule based) is file containing predicted tags for each word in the same line as the word.
Voting algorithm uses trained CRF and MaxEnt model and rule based model's result and
used the result of these to give better results. Voting is done on the results of these three
models and the one with the most weight is the final tag.
7. RESULTS
7.1 Performance Evaluation Metric
The Evaluation measure for the data sets is precision, recall and F Measure.
• Precision (P): Precision is the fraction of the documents retrieved that are relevant to the
user's information need.
10. Shilpi Srivastava, Mukund Sanglikar & D.C Kothari
International Journal of Computational Linguistics (IJCL), Volume (2) : Issue (1) : 2011 19
correct answers
Precision(P) =
answers produced
• Recall (R): Recall is the fraction of the documents that are relevant to the query that are
successfully retrieved.
correct answers
Recall(R) =
total possible correct answers
• F-Measure: The weighted harmonic mean of precision and recall, the traditional F-
measure or balanced F-score is
2
2
( 1)PR
F Measure
R P
β
β
+
− =
+
β is the weighting between precision and recall typically 1β = .
When recall and precision are evenly weighted i.e. 1β = , F-measure is called F1-
Measure.
2
1
( )
PR
F Measure
P R
− =
+
There is a tradeoff between precision and recall in the performance metric.
7.2 Results Obtained
• CRF Results: The following table contains the results obtained from testing CRF models.
The model is trained on 10, 50, 100 and 150 files and then tested on 10 files. This is
done for 10 rounds i.e. for model trained on 100 files, 110 files are selected from the
dataset and it is trained on 100 files and tested on 10 files(model trained on 10 files are
tested on 5 files). Then again 110 files are chosen and training and testing is done. This
is done for 10 times. Table 2 contains the results obtained from the above experiment.
Number of
training files
Number of
testing files
Precision Recall F-1 Measure
10 5 71.43 30.86 43.10
50 10 83.87 25.74 39.40
100 10 88.24 24.19 37.97
150 10 88.89 24.61 38.55
TABLE 2: CRF results for one best predicted tag
For the above experiments only one predicted tag of a word is considered. Since the
number of NE tags are less compared to “none” tag, so the model learns mostly for
“none” tag. So we considered using best of two of the predicted tags of a word to check
the results. Here two best predicted tags are given by the model. The two tags can be
either same or different. If first tag is a NE tag then that tag is considered correct. If first is
none tag and second is NE tag then second tag is considered for the results. This
experiment is also conducted in a similar manner as the above experiment.
The results obtained from the above experiment for CRF when two of the best predicted
tags are taken into consideration is shown in the Table 3:
11. Shilpi Srivastava, Mukund Sanglikar & D.C Kothari
International Journal of Computational Linguistics (IJCL), Volume (2) : Issue (1) : 2011 20
Number of
training files
Number of
testing files
Precision Recall F-1 Measure
10 5 70.0 34.57 46.28
50 10 89.28 49.5 63.69
100 10 83.33 33.9 48.19
150 10 74.28 33.37 46.43
TABLE 3: CRF results for best of two predicted tags
• MaxEnt Results: Following tables contain the results of training and testing of MaxEnt
model. The model is trained on randomly chosen 10, 50, 100 and 150 files and then
tested on 10 files on which it is not trained. Each of the training and testing is done for ten
rounds. Similar to above these are also tested on different datasets. The results obtained
is shown in the following table 4:
Number of
training files
Number of
testing files
Precision Recall F-1 Measure
10 5 76.92 19.8 31.49
50 10 70.40 16.68 26.39
100 10 69.21 18.14 28.19
150 10 69.46 16.57 26.06
TABLE 4: MaxEnt Results for one best predicted tag
MaxEnt results when two of the best predicted tags are taken into consideration are given
in Table 5. This is done in similar way as done in CRF experiment.
Number of
training files
Number of
testing files
Precision Recall F-1 Measure
10 5 90.47 29.23 44.18
50 10 89.28 21.36 34.48
100 10 87.5 22.58 35.89
150 10 96.15 25.25 39.99
T
TABLE 5: MaxEnt Results for best of two predicted tags
• Rule Based Results: Results driven from rule based model is given below in Table 6:
Number of
testing files
Precision Recall F-1 Measure
1 65.93 77.92 71.43
2 88.0 60.27 71.54
3 96.05 86.90 91.25
TABLE 6: Rule based model's test results
• Voting Algorithm: For voting we used three classifiers crf trained on 50 files, MaxEnt
trained on 50 files and rule based. Results from voting algorithm model is given in Table
7:
12. Shilpi Srivastava, Mukund Sanglikar & D.C Kothari
International Journal of Computational Linguistics (IJCL), Volume (2) : Issue (1) : 2011 21
Number of testing
files
Precision Recall F-1 measure
40 81.11 84.88 82.95
40 85.51 76.62 80.82
TABLE 7: Voting Algorithm's Results
8. CONCLUSION
Basically this paper presents a comparative study among different approaches like MaxEnt, CRF
and Rulebase using POS & orthographic features. It also shows that voting mechanism gives the
better results. On average CRF gives better result than MaxEnt. Rule based result has better
recall and F-1 measure. On the given data the average precision is good. The main reason for
the lower F-1 measure by CRF and MaxEnt is due to the presence of less NE tags in the original
data compared to "none". For most file the percentage of NE tags is less that 2% of the total
words present in a file. Because of that the classifier is learned more strongly for "none" rather
than NE tags. Also data has tagging errors. e.g. "Gandhi" is classified as <ne=NEN>, <ne=NEP>,
<ne=NED>,"none" in many files. Similarly "ek" is classified as <ne=NEN> or "none". These
conflicting cases in the training set weaken the classifier. That's why more training doesn't give
better results here. The classifier gives good precisions i.e. less tags are classified but they are
classified correctly.
When we took best of two predicted tags for the results analysis F-1 measure and recall
increases significantly. Since we have very few NE tags in data and also data is not very
accurate, so most of the words are learned as "none", but when we consider best of two
predicted tags, the result improves significantly. Rule based model gives better average result (F-
1 measure, recall) for given data. Voting algorithm improves the F-1 measure of results.
9. FUTURE WORK
Dictionary matching of words is not very effective. In this experiment we used Othographic
features like other researchers however POS tagger or morphological analyzer, semantic tags,
parasargs (prepositions and postpositions) identification, lexicon database and co-occurrences
may give the better results. Boosting may be done by containing 5 words above NE tags and 5
words below NE tags. Conflicting tags can be removed. Or we may try using another dataset.
More features can be added to improve the models. Rule based model can be improved. We may
experiment with other classifier like HMM.
10. ACKNOWLEDGMENT
I would like to thank Mr. Pankaj Srivastava, Ms. Agrima Srivastava and MS. Vertika Khanna who
provide helpful analysis in model development.
11. REFERENCES:
[1] Sudeshna Sarkar, Sujan Saha and Prthasarthi Ghosh, "Named Entity Recognition for Hindi",
In Microsoft Research India Summer School talk, p. 21-30, May 2007.
[2] Anil Kumar Singh, "Named Entity Recognition for South and South East Asian Languages:
Taking Stock", p. 5-7, In IJCNLP 2008.
[3] Hideki Isozaki. 2001. “Japanese named entity recognition based on a simple rule generator
and decision tree learning” in the proceedings of the Association for Computational
Linguistics, pages 306-313. India.
[4] Takeuchi K. and Collier N. 2002. “Use of Support Vector Machines in extended named entity
recognition” in the proceedings of the sixth Conference on Natural Language Learning
(CoNLL-2002), Taipei, Taiwan, China.
13. Shilpi Srivastava, Mukund Sanglikar & D.C Kothari
International Journal of Computational Linguistics (IJCL), Volume (2) : Issue (1) : 2011 22
[5] Charles L. Wayne. 1991., “A snapshot of two DARPA speech and Natural Language
Programs” in the proceedings of workshop on Speech and Natural Languages, pages 103-
404, Pacific Grove, California. Association for Computational Linguistics.
[6] A. Borthwick, "A Maximum Entropy Approach to Named Entity Recognition", In NY
University, p. 1-4, 18-24, PHD Thesis, September 1999
[7] Daniel M. Bikel, Scott Miller, Richard Schwartz and Ralph Weischedel. 1997 “Nymble: a high
performance learning name-finder” in the proceedings of the fifth conference on Applied
natural language processing, pages 194-201, San Francisco, CA, USA Morgan Kaufmann
Publishers Inc.
[8] IJCNLP-08 Workshop data set, Source: http://ltrc.iiit.net/ner-ssea-08/index.cgi?topic=5
[9] Akshar Bharti, Rajeev Sangal and Dipti M Sharma, "Shakti Analyzer: SSF Representation",
IIIT Hyderabad, p. 3-5, 2006
[10] Lafferty, J., McCallum, A., Pereira, F., "Conditional random fields: Probabilistic models for
segmenting and labeling sequence data", In: Proc. 18th International Conf. on Machine
Learning, Morgan Kaufmann, San Francisco, p. 1-5, 2001
[11] Hindi Wordnet, Source: http://www.cfilt.iitb.ac.in/wordnet/webhwn/
[12] McCallum, Andrew Kachites. "MALLET: A Machine Learning for Language Toolkit."
http://mallet.cs.umass.edu. 2002.
[13] Hanna M. Wallach, "Conditional Random Fields: An Introduction”, Technical Report,
University of Pennsylvania. 4-5, 2004.
[14] Lawrence R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in
Speech Recognition", In Proceedings of the IEEE, 77 (2), p. 257-286,February 1989
[15] R. Grishman. 1995. “The NYU system for MUC-6 or Where’s the Syntax” in the proceedings
of Sixth Message Understanding Conference (MUC-6) , pages 167-195, Fairfax, Virginia.
[16] Wakao T., Gaizauskas R. and Wilks Y. 1996. “Evaluation of an algorithm for the Recognition
and Classification of Proper Names”, in the proceedings of COLING-96.
[17] Mikheev A, Grover C. and Moens M. 1998. Description of the LTG system used for MUC-7.
In Proceedings of the Seventh Message Understanding Conference.
[18] R. Grishman, Beth Sundheim. 1996. “Message Understanding Conference-6: A Brief
History” in the proceedings of the 16th International Conference on Computational
Linguistics (COLING), pages 466-471, Center for Sprogteknologi, Copenhagen, Denmark.
[19] Srihari R., Niu C. and Li W. 2000. A Hybrid Approach for Named Entity and Sub-Type
Tagging. In: Proceedings of the sixth conference on applied natural language processing.
[20] Cucerzan S. and Yarowsky D. 1999. Language independent named entity recognition
combining morphological and contextual evidence. In: Proceedings of the Joint SIGDAT
Conference on EMNLP and VLC 1999, pp. 90-99.
[21] Li W. and McCallum A. 2003. Rapid Development of Hindi Named Entity Recognition using
Conditional Random Fields and Feature Induction. In: ACM Transactions on Asian
Language Information Processing (TALIP), 2(3): 290–294.
14. Shilpi Srivastava, Mukund Sanglikar & D.C Kothari
International Journal of Computational Linguistics (IJCL), Volume (2) : Issue (1) : 2011 23
[22] Gali, K., Sharma, H., Vaidya, A., Shisthla, P., Sharma, D.M.: Aggregrating Machine
Learning and Rule-based Heuristics for Named Entity Recognition. In: Proceedings of the
IJCNLP-08Workshop on NER for South and South East Asian Languages. (2008) 25–32
[23] Asif Ekbal et. al. “Language Independent Named Entity Recognition in Indian Languages”.
IJCNLP, 2008.
[24] Prasad Pingli et al. “A Hybrid Approach for Named Entity Recognition in Indian Languages”.
IJCNLP, 2008.
[25] Shilpi Srivastava, Siby Abraham, Mukund Sanglikar: “Hybrid Approach for Recognizing Hindi
Named Entity”, Proceedings of the International Conference on Managing Next Generation
Software Applications - 2008 (MNGSA 2008), Coimbatore, India, 5th- 6th December 2008.
[26] Shilpi Srivastava, Siby Abraham, Mukund Sanglikar, D C Kothari: “Role of Ensemble
Learning in Identifying Hindi Names”, International Journal of Computer Science and
Applications, ISSN No. 0974-0767.