Named Entity Recognition is always important when dealing with major Natural Language Processing
tasks such as information extraction, question-answering, machine translation, document summarization
etc so in this paper we put forward a survey of Named Entities in Indian Languages with particular
reference to Assamese. There are various rule-based and machine learning approaches available for
Named Entity Recognition. At the very first of the paper we give an idea of the available approaches for
Named Entity Recognition and then we discuss about the related research in this field. Assamese like other
Indian languages is agglutinative and suffers from lack of appropriate resources as Named Entity
Recognition requires large data sets, gazetteer list, dictionary etc and some useful feature like
capitalization as found in English cannot be found in Assamese. Apart from this we also describe some of
the issues faced in Assamese while doing Named Entity Recognition.
Named Entity Recognition for Telugu Using Conditional Random FieldWaqas Tariq
Named Entity (NE) recognition is a task in which proper nouns and numerical information are extracted from documents and are classified into predefined categories such as Person names, Organization names , Location names, miscellaneous(Date and others). It is a key technology of Information Extraction, Question Answering system, Machine Translations, Information Retrial etc. This paper reports about the development of a NER system for Telugu using Conditional Random field (CRF). Though this state of the art machine learning technique has been widely applied to NER in several well-studied languages, the use of this technique to Telugu languages is very new. The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the four different named entities (NE) classes, such as Person name, Location name, Organization name, miscellaneous (Date and others). Keywords: Named entity, Conditional Random field, NE, CRF, NER, named entity recognition
Named Entity Recognition using Hidden Markov Model (HMM)kevig
Named Entity Recognition (NER) is the subtask of Natural Language Processing (NLP) which is the branch of artificial intelligence. It has many applications mainly in machine translation, text to speech synthesis, natural language understanding, Information Extraction, Information retrieval, question answering etc. The aim of NER is to classify words into some predefined categories like location name, person name, organization name, date, time etc. In this paper we describe the Hidden Markov Model (HMM) based approach of machine learning in detail to identify the named entities. The main idea behind the use of HMM model for building NER system is that it is language independent and we can apply this system for any language domain. In our NER system the states are not fixed means it is of dynamic in nature one can use it according to their interest. The corpus used by our NER system is also not domain specific
Abstract
Part of speech tagging plays an important role in developing natural language processing software. Part of speech tagging means assigning part of speech tag to each word of the sentence. The part of speech tagger takes a sentence as input and it assigns respective/appropriate part of speech tag to each word of that sentence. In this article I surveys the different work have done about odia POS tagging.
________________________________________________
Part of Speech tagging in Indian Languages is still an open problem. We still lack a clear approach in implementing a POS tagger for Indian Languages. In this paper we describe our efforts to build a Hidden Markov Model based Part of Speech Tagger. We have used IL POS tag set for the development of this tagger. We have achieved the accuracy of 92%
A New Approach to Parts of Speech Tagging in Malayalamijcsit
Parts-of-speech tagging is the process of labeling each word in a sentence. A tag mentions the word’s
usage in the sentence. Usually, these tags indicate syntactic classification like noun or verb, and sometimes
include additional information, with case markers (number, gender etc) and tense markers. A large number
of current language processing systems use a parts-of-speech tagger for pre-processing.
There are mainly two approaches usually followed in Parts of Speech Tagging. Those are Rule based
Approach and Stochastic Approach. Rule based Approach use predefined handwritten rules. This is the
oldest approach and it use lexicon or dictionary for reference. Stochastic Approach use probabilistic and
statistical information to assign tag to words. It use large corpus, so that Time complexity and Space
complexity is high whereas Rule base approach has less complexity for both Time and Space. Stochastic
Approach is the widely used one nowadays because of its accuracy.
Malayalam is a Dravidian family of languages, inflectional with suffixes with the root word forms. The
currently used Algorithms are efficient Machine Learning Algorithms but these are not built for
Malayalam. So it affects the accuracy of the result of Malayalam POS Tagging.
My proposed Approach use Dictionary entries along with adjacent tag information. This algorithm use
Multithreaded Technology. Here tagging done with the probability of the occurrence of the sentence
structure along with the dictionary entry.
Survey on Indian CLIR and MT systems in Marathi LanguageEditor IJCATR
Cross Language Information Retrieval (CLIR) deals with retrieving relevant information stored in a language different from
the language of user’s query. This helps users to express the information need in their native languages. Machine translation based (MTbased)
approach of CLIR uses existing machine translation techniques to provide automatic translation of queries. This paper covers the
research work done in CLIR and MT systems for Marathi language in India.
Named Entity Recognition for Telugu Using Conditional Random FieldWaqas Tariq
Named Entity (NE) recognition is a task in which proper nouns and numerical information are extracted from documents and are classified into predefined categories such as Person names, Organization names , Location names, miscellaneous(Date and others). It is a key technology of Information Extraction, Question Answering system, Machine Translations, Information Retrial etc. This paper reports about the development of a NER system for Telugu using Conditional Random field (CRF). Though this state of the art machine learning technique has been widely applied to NER in several well-studied languages, the use of this technique to Telugu languages is very new. The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the four different named entities (NE) classes, such as Person name, Location name, Organization name, miscellaneous (Date and others). Keywords: Named entity, Conditional Random field, NE, CRF, NER, named entity recognition
Named Entity Recognition using Hidden Markov Model (HMM)kevig
Named Entity Recognition (NER) is the subtask of Natural Language Processing (NLP) which is the branch of artificial intelligence. It has many applications mainly in machine translation, text to speech synthesis, natural language understanding, Information Extraction, Information retrieval, question answering etc. The aim of NER is to classify words into some predefined categories like location name, person name, organization name, date, time etc. In this paper we describe the Hidden Markov Model (HMM) based approach of machine learning in detail to identify the named entities. The main idea behind the use of HMM model for building NER system is that it is language independent and we can apply this system for any language domain. In our NER system the states are not fixed means it is of dynamic in nature one can use it according to their interest. The corpus used by our NER system is also not domain specific
Abstract
Part of speech tagging plays an important role in developing natural language processing software. Part of speech tagging means assigning part of speech tag to each word of the sentence. The part of speech tagger takes a sentence as input and it assigns respective/appropriate part of speech tag to each word of that sentence. In this article I surveys the different work have done about odia POS tagging.
________________________________________________
Part of Speech tagging in Indian Languages is still an open problem. We still lack a clear approach in implementing a POS tagger for Indian Languages. In this paper we describe our efforts to build a Hidden Markov Model based Part of Speech Tagger. We have used IL POS tag set for the development of this tagger. We have achieved the accuracy of 92%
A New Approach to Parts of Speech Tagging in Malayalamijcsit
Parts-of-speech tagging is the process of labeling each word in a sentence. A tag mentions the word’s
usage in the sentence. Usually, these tags indicate syntactic classification like noun or verb, and sometimes
include additional information, with case markers (number, gender etc) and tense markers. A large number
of current language processing systems use a parts-of-speech tagger for pre-processing.
There are mainly two approaches usually followed in Parts of Speech Tagging. Those are Rule based
Approach and Stochastic Approach. Rule based Approach use predefined handwritten rules. This is the
oldest approach and it use lexicon or dictionary for reference. Stochastic Approach use probabilistic and
statistical information to assign tag to words. It use large corpus, so that Time complexity and Space
complexity is high whereas Rule base approach has less complexity for both Time and Space. Stochastic
Approach is the widely used one nowadays because of its accuracy.
Malayalam is a Dravidian family of languages, inflectional with suffixes with the root word forms. The
currently used Algorithms are efficient Machine Learning Algorithms but these are not built for
Malayalam. So it affects the accuracy of the result of Malayalam POS Tagging.
My proposed Approach use Dictionary entries along with adjacent tag information. This algorithm use
Multithreaded Technology. Here tagging done with the probability of the occurrence of the sentence
structure along with the dictionary entry.
Survey on Indian CLIR and MT systems in Marathi LanguageEditor IJCATR
Cross Language Information Retrieval (CLIR) deals with retrieving relevant information stored in a language different from
the language of user’s query. This helps users to express the information need in their native languages. Machine translation based (MTbased)
approach of CLIR uses existing machine translation techniques to provide automatic translation of queries. This paper covers the
research work done in CLIR and MT systems for Marathi language in India.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
KANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATIONijnlc
Named Entity Recognition and Classification (NERC) is a process of identification of proper nouns in the text and classification of those nouns into certain predefined categories like person name, location,organization, date, and time etc. NERC in Kannada is an essential and challenging task. The aim of this work is to develop a novel model for NERC, based on Multinomial Naïve Bayes (MNB) Classifier. The Methodology adopted in this paper is based on feature extraction of training corpus, by using term frequency, inverse document frequency and fitting them to a tf-idf-vectorizer. The paper discusses the
various issues in developing the proposed model. The details of implementation and performance evaluation are discussed. The experiments are conducted on a training corpus of size 95,170 tokens and test corpus of 5,000 tokens. It is observed that the model works with Precision, Recall and F1-measure of
83%, 79% and 81% respectively.
SYLLABLE-BASED NEURAL NAMED ENTITY RECOGNITION FOR MYANMAR LANGUAGEijnlc
Named Entity Recognition (NER) for Myanmar Language is essential to Myanmar natural language processing research work. In this work, NER for Myanmar language is treated as a sequence tagging problem and the effectiveness of deep neural networks on NER for Myanmar language has been investigated. Experiments are performed by applying deep neural network architectures on syllable level Myanmar contexts. Very first manually annotated NER corpus for Myanmar language is also constructed and proposed. In developing our in-house NER corpus, sentences from online news website and also sentences supported from ALT-Parallel-Corpus are also used. This ALT corpus is one part of the Asian Language Treebank (ALT) project under ASEAN IVO. This paper contributes the first evaluation of neural network models on NER task for Myanmar language. The experimental results show that those neural sequence models can produce promising results compared to the baseline CRF model. Among those neural architectures, bidirectional LSTM network added CRF layer above gives the highest F-score value. This work also aims to discover the effectiveness of neural network approaches to Myanmar textual processing as well as to promote further researches on this understudied language.
Named Entity Recognition System for Hindi Language: A Hybrid ApproachWaqas Tariq
Named Entity Recognition (NER) is a major early step in Natural Language Processing (NLP) tasks like machine translation, text to speech synthesis, natural language understanding etc. It seeks to classify words which represent names in text into predefined categories like location, person-name, organization, date, time etc. In this paper we have used a combination of machine learning and Rule based approaches to classify named entities. The paper introduces a hybrid approach for NER. We have experimented with Statistical approaches like Conditional Random Fields (CRF) & Maximum Entropy (MaxEnt) and Rule based approach based on the set of linguistic rules. Linguistic approach plays a vital role in overcoming the limitations of statistical models for morphologically rich language like Hindi. Also the system uses voting method to improve the performance of the NER system. Keywords: NER, MaxEnt, CRF, Rule base, Voting, Hybrid approach
Quality estimation of machine translation outputs through stemmingijcsa
Machine Translation is the challenging problem for Indian languages. Every day we can see some machine
translators being developed , but getting a high quality automatic translation is still a very distant dream .
The correct translated sentence for Hindi language is rarely found. In this paper, we are emphasizing on
English-Hindi language pair, so in order to preserve the correct MT output we present a ranking system,
which employs some machine learning techniques and morphological features. In ranking no human
intervention is required. We have also validated our results by comparing it with human ranking.
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character
recognition (OCR). For indigenous scripts of India, there are very few OCRs that can successfully recognize printed text images of varying quality, size, style and font. An alternate approach using word spotting can be effective to access large collections of document images. We propose a word spotting
technique based on codes for matching the word images of Devanagari script. The shape information is utilised for generating integer codes for words in the document image and these codes are matched for final retrieval of relevant documents. The technique is illustrated using Marathi document images.
A study on the approaches of developing a named entity recognition tooleSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Author Credits - Maaz Anwar Nomani
Semantic Role Labeler (SRL) is a semantic parser which can automatically identify and then classify arguments of a verb in a natural language sentence for Hindi and Urdu. For e.g. in the natural language sentence “Sara won the competition because of her hard work.”, ‘won’ is the main verb and there are 3 arguments for this verb; ‘Sara’ (Agent), ‘hard work’ (Reason) and ‘competition’ (Theme). The problem statement of a SRL revolves around the fact that how will you make a machine identify and then classify the arguments of a verb in a natural language sentence.
Since there are 2 sub problem statements here (Identification and Classification), our SRL has a pipeline architecture in which a binary classifier (Logistic Regression) is first trained to identify whether a word is an argument to a verb in a sentence or not (Yes or No) and subsequently a multi-class classifier (SVM with Linear kernel) is trained to classify the identified arguments by above binary classifier into one of the 20 classes. These 20 classes are the various notions present in a natural language sentence (for e.g. Agent, Theme, Location, Time, Purpose, Reason, Cause etc.). These ‘notions’ are called Propbank labels or semantic labels present in a Proposition Bank which is a collection of hand-annotated sentences.
In essence, SRL felicitates Semantic Parsing which essentially is the research investigation of identifying WHO did WHAT to WHOM, WHERE, HOW, WHY and WHEN etc. in a natural language sentence.
SEMI-SUPERVISED BOOTSTRAPPING APPROACH FOR NAMED ENTITY RECOGNITIONkevig
The aim of Named Entity Recognition (NER) is to identify references of named entities in unstructured documents, and to classify them into pre-defined semantic categories. NER often aids from added background knowledge in the form of gazetteers. However using such a collection does not deal with name variants and cannot resolve ambiguities associated in identifying the entities in context and associating them with predefined categories. We present a semi-supervised NER approach that starts with identifying named entities with a small set of training data. Using the identified named entities, the word and the context features are used to define the pattern. This pattern of each named entity category is used as a seed pattern to identify the named entities in the test set. Pattern scoring and tuple value score enables the generation of the new patterns to identify the named entity categories. We have evaluated the proposed system for English language with the dataset of tagged (IEER) and untagged (CoNLL 2003) named entity corpus and for Tamil language with the documents from the FIRE corpus and yield an average f-measure of 75% for both the languages.
A comparative study on term weighting methods for automated telugu text categ...IJDKP
Automatic Text categorization refers to the process of assigning a category or some categories
automatically among predefined ones. Text categorization is challenging in Indian languages has rich in
morphology, a large number of word forms and large feature spaces. This paper investigates the
performance of different classification approaches using different term weighting approaches in order to
decide the most applicable one to Telugu text classification problem. We have investigated on different
term weighting methods for Telugu corpus in combination with Naive Bayes ( NB), Support Vector
Machine (SVM) and k Nearest Neighbor (kNN) classifiers.
HINDI NAMED ENTITY RECOGNITION BY AGGREGATING RULE BASED HEURISTICS AND HIDDE...ijistjournal
Named entity recognition (NER) is one of the applications of Natural Language Processing and is regarded as the subtask of information retrieval. NER is the process to detect Named Entities (NEs) in a document and to categorize them into certain Named entity classes such as the name of organization, person, location, sport, river, city, country, quantity etc. In English, we have accomplished lot of work related to NER. But, at present, still we have not been able to achieve much of the success pertaining to NER in the Indian languages. The following paper discusses about NER, the various approaches of NER, Performance Metrics, the challenges in NER in the Indian languages and finally some of the results that have been achieved by performing NER in Hindi by aggregating approaches such as Rule based heuristics and Hidden Markov Model (HMM).
Named Entity Recognition using Hidden Markov Model (HMM)kevig
Named Entity Recognition (NER) is the subtask of Natural Language Processing (NLP) which is the branch
of artificial intelligence. It has many applications mainly in machine translation, text to speech synthesis,
natural language understanding, Information Extraction, Information retrieval, question answering etc.
The aim of NER is to classify words into some predefined categories like location name, person name,
organization name, date, time etc. In this paper we describe the Hidden Markov Model (HMM) based
approach of machine learning in detail to identify the named entities. The main idea behind the use of
HMM model for building NER system is that it is language independent and we can apply this system for
any language domain. In our NER system the states are not fixed means it is of dynamic in nature one can
use it according to their interest. The corpus used by our NER system is also not domain specific.
Named Entity Recognition using Hidden Markov Model (HMM)kevig
Named Entity Recognition (NER) is the subtask of Natural Language Processing (NLP) which is the branch of artificial intelligence. It has many applications mainly in machine translation, text to speech synthesis, natural language understanding, Information Extraction, Information retrieval, question answering etc. The aim of NER is to classify words into some predefined categories like location name, person name,
organization name, date, time etc. In this paper we describe the Hidden Markov Model (HMM) based approach of machine learning in detail to identify the named entities. The main idea behind the use of HMM model for building NER system is that it is language independent and we can apply this system for
any language domain. In our NER system the states are not fixed means it is of dynamic in nature one can use it according to their interest. The corpus used by our NER system is also not domain specific.
Myanmar Named Entity Recognition with Hidden Markov Modelijtsrd
Named Entity Recognition is the process to detect Named Entities NEs in a file , document or from a corpus and to categorize them into certain Named entity classes like name of city, State, Country, organization, person, location, sport, river, quantity etc. This paper introduces the Named Entities Recognition NER for Myanmar language using Hidden Markov Model HMM .The main idea behind the use of HMM language independent and we can apply this system for any language domain. The corpus used by our NER system is also not domain specific. Khin Khin Lay | Aung Cho ""Myanmar Named Entity Recognition with Hidden Markov Model"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd24012.pdf
Paper URL: https://www.ijtsrd.com/computer-science/artificial-intelligence/24012/myanmar-named-entity-recognition-with-hidden-markov-model/khin-khin-lay
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSkevig
Text analysis has been attracting increasing attention in this data era. Selecting effective features from
datasets is a particular important part in text classification studies. Feature selection excludes irrelevant
features from the classification task, reduces the dimensionality of a dataset, and improves the accuracy
and performance of identification. So far, so many feature selection methods have been proposed, however,
it remains unclear which method is the most effective in practice. This article focuses on evaluating and
comparing the available feature selection methods in general versatility regarding authorship attribution
problems and tries to identify which method is the most effective. The discussions on general versatility of
feature selection methods and its connection in selecting the appropriate features for varying data were
done. In addition, different languages, different types of features, different systems for calculating the
accuracy of SVM (support vector machine), and different criteria for determining the rank of feature
selection methods were used to measure the general versatility of these methods together. The analysis
results indicate the best feature selection method is different for each dataset; however, some methods can
always extract useful information to discriminate the classes. The chi-square was proved to be a better
method overall.
Duration for Classification and Regression Treefor Marathi Textto- Speech Syn...IJERA Editor
This research paper reports preliminary results of data-driven modeling of segmentalphoneme duration for
Marathi. Classification and Regression Tree based data driven duration modeling for segmental duration
prediction is presented. A number of features are considered and their usefulness and relative contribution for
segmental duration prediction is assessed. Objective evaluation of the duration model, by root mean squared
prediction error and correlation between actual and predicted durations, is performed.
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSkevig
This article focuses on evaluating and comparing the available feature selection methods in general versatility regarding authorship attribution problems and tries to identify which method is the most effective. The discussions on general versatility of feature selection methods and its connection in selecting the appropriate features for varying data were done. In addition, different languages, different types of features, different systems for calculating the accuracy of SVM (support vector machine), and different criteria for determining the rank of feature selection methods were used to measure the general versatility of these methods together. The analysis results indicate the best feature selection method is different for each dataset; however, some methods can always extract useful information to discriminate the classes. The chi-square was proved to be a better method overall.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
KANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATIONijnlc
Named Entity Recognition and Classification (NERC) is a process of identification of proper nouns in the text and classification of those nouns into certain predefined categories like person name, location,organization, date, and time etc. NERC in Kannada is an essential and challenging task. The aim of this work is to develop a novel model for NERC, based on Multinomial Naïve Bayes (MNB) Classifier. The Methodology adopted in this paper is based on feature extraction of training corpus, by using term frequency, inverse document frequency and fitting them to a tf-idf-vectorizer. The paper discusses the
various issues in developing the proposed model. The details of implementation and performance evaluation are discussed. The experiments are conducted on a training corpus of size 95,170 tokens and test corpus of 5,000 tokens. It is observed that the model works with Precision, Recall and F1-measure of
83%, 79% and 81% respectively.
SYLLABLE-BASED NEURAL NAMED ENTITY RECOGNITION FOR MYANMAR LANGUAGEijnlc
Named Entity Recognition (NER) for Myanmar Language is essential to Myanmar natural language processing research work. In this work, NER for Myanmar language is treated as a sequence tagging problem and the effectiveness of deep neural networks on NER for Myanmar language has been investigated. Experiments are performed by applying deep neural network architectures on syllable level Myanmar contexts. Very first manually annotated NER corpus for Myanmar language is also constructed and proposed. In developing our in-house NER corpus, sentences from online news website and also sentences supported from ALT-Parallel-Corpus are also used. This ALT corpus is one part of the Asian Language Treebank (ALT) project under ASEAN IVO. This paper contributes the first evaluation of neural network models on NER task for Myanmar language. The experimental results show that those neural sequence models can produce promising results compared to the baseline CRF model. Among those neural architectures, bidirectional LSTM network added CRF layer above gives the highest F-score value. This work also aims to discover the effectiveness of neural network approaches to Myanmar textual processing as well as to promote further researches on this understudied language.
Named Entity Recognition System for Hindi Language: A Hybrid ApproachWaqas Tariq
Named Entity Recognition (NER) is a major early step in Natural Language Processing (NLP) tasks like machine translation, text to speech synthesis, natural language understanding etc. It seeks to classify words which represent names in text into predefined categories like location, person-name, organization, date, time etc. In this paper we have used a combination of machine learning and Rule based approaches to classify named entities. The paper introduces a hybrid approach for NER. We have experimented with Statistical approaches like Conditional Random Fields (CRF) & Maximum Entropy (MaxEnt) and Rule based approach based on the set of linguistic rules. Linguistic approach plays a vital role in overcoming the limitations of statistical models for morphologically rich language like Hindi. Also the system uses voting method to improve the performance of the NER system. Keywords: NER, MaxEnt, CRF, Rule base, Voting, Hybrid approach
Quality estimation of machine translation outputs through stemmingijcsa
Machine Translation is the challenging problem for Indian languages. Every day we can see some machine
translators being developed , but getting a high quality automatic translation is still a very distant dream .
The correct translated sentence for Hindi language is rarely found. In this paper, we are emphasizing on
English-Hindi language pair, so in order to preserve the correct MT output we present a ranking system,
which employs some machine learning techniques and morphological features. In ranking no human
intervention is required. We have also validated our results by comparing it with human ranking.
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character
recognition (OCR). For indigenous scripts of India, there are very few OCRs that can successfully recognize printed text images of varying quality, size, style and font. An alternate approach using word spotting can be effective to access large collections of document images. We propose a word spotting
technique based on codes for matching the word images of Devanagari script. The shape information is utilised for generating integer codes for words in the document image and these codes are matched for final retrieval of relevant documents. The technique is illustrated using Marathi document images.
A study on the approaches of developing a named entity recognition tooleSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Author Credits - Maaz Anwar Nomani
Semantic Role Labeler (SRL) is a semantic parser which can automatically identify and then classify arguments of a verb in a natural language sentence for Hindi and Urdu. For e.g. in the natural language sentence “Sara won the competition because of her hard work.”, ‘won’ is the main verb and there are 3 arguments for this verb; ‘Sara’ (Agent), ‘hard work’ (Reason) and ‘competition’ (Theme). The problem statement of a SRL revolves around the fact that how will you make a machine identify and then classify the arguments of a verb in a natural language sentence.
Since there are 2 sub problem statements here (Identification and Classification), our SRL has a pipeline architecture in which a binary classifier (Logistic Regression) is first trained to identify whether a word is an argument to a verb in a sentence or not (Yes or No) and subsequently a multi-class classifier (SVM with Linear kernel) is trained to classify the identified arguments by above binary classifier into one of the 20 classes. These 20 classes are the various notions present in a natural language sentence (for e.g. Agent, Theme, Location, Time, Purpose, Reason, Cause etc.). These ‘notions’ are called Propbank labels or semantic labels present in a Proposition Bank which is a collection of hand-annotated sentences.
In essence, SRL felicitates Semantic Parsing which essentially is the research investigation of identifying WHO did WHAT to WHOM, WHERE, HOW, WHY and WHEN etc. in a natural language sentence.
SEMI-SUPERVISED BOOTSTRAPPING APPROACH FOR NAMED ENTITY RECOGNITIONkevig
The aim of Named Entity Recognition (NER) is to identify references of named entities in unstructured documents, and to classify them into pre-defined semantic categories. NER often aids from added background knowledge in the form of gazetteers. However using such a collection does not deal with name variants and cannot resolve ambiguities associated in identifying the entities in context and associating them with predefined categories. We present a semi-supervised NER approach that starts with identifying named entities with a small set of training data. Using the identified named entities, the word and the context features are used to define the pattern. This pattern of each named entity category is used as a seed pattern to identify the named entities in the test set. Pattern scoring and tuple value score enables the generation of the new patterns to identify the named entity categories. We have evaluated the proposed system for English language with the dataset of tagged (IEER) and untagged (CoNLL 2003) named entity corpus and for Tamil language with the documents from the FIRE corpus and yield an average f-measure of 75% for both the languages.
A comparative study on term weighting methods for automated telugu text categ...IJDKP
Automatic Text categorization refers to the process of assigning a category or some categories
automatically among predefined ones. Text categorization is challenging in Indian languages has rich in
morphology, a large number of word forms and large feature spaces. This paper investigates the
performance of different classification approaches using different term weighting approaches in order to
decide the most applicable one to Telugu text classification problem. We have investigated on different
term weighting methods for Telugu corpus in combination with Naive Bayes ( NB), Support Vector
Machine (SVM) and k Nearest Neighbor (kNN) classifiers.
HINDI NAMED ENTITY RECOGNITION BY AGGREGATING RULE BASED HEURISTICS AND HIDDE...ijistjournal
Named entity recognition (NER) is one of the applications of Natural Language Processing and is regarded as the subtask of information retrieval. NER is the process to detect Named Entities (NEs) in a document and to categorize them into certain Named entity classes such as the name of organization, person, location, sport, river, city, country, quantity etc. In English, we have accomplished lot of work related to NER. But, at present, still we have not been able to achieve much of the success pertaining to NER in the Indian languages. The following paper discusses about NER, the various approaches of NER, Performance Metrics, the challenges in NER in the Indian languages and finally some of the results that have been achieved by performing NER in Hindi by aggregating approaches such as Rule based heuristics and Hidden Markov Model (HMM).
Named Entity Recognition using Hidden Markov Model (HMM)kevig
Named Entity Recognition (NER) is the subtask of Natural Language Processing (NLP) which is the branch
of artificial intelligence. It has many applications mainly in machine translation, text to speech synthesis,
natural language understanding, Information Extraction, Information retrieval, question answering etc.
The aim of NER is to classify words into some predefined categories like location name, person name,
organization name, date, time etc. In this paper we describe the Hidden Markov Model (HMM) based
approach of machine learning in detail to identify the named entities. The main idea behind the use of
HMM model for building NER system is that it is language independent and we can apply this system for
any language domain. In our NER system the states are not fixed means it is of dynamic in nature one can
use it according to their interest. The corpus used by our NER system is also not domain specific.
Named Entity Recognition using Hidden Markov Model (HMM)kevig
Named Entity Recognition (NER) is the subtask of Natural Language Processing (NLP) which is the branch of artificial intelligence. It has many applications mainly in machine translation, text to speech synthesis, natural language understanding, Information Extraction, Information retrieval, question answering etc. The aim of NER is to classify words into some predefined categories like location name, person name,
organization name, date, time etc. In this paper we describe the Hidden Markov Model (HMM) based approach of machine learning in detail to identify the named entities. The main idea behind the use of HMM model for building NER system is that it is language independent and we can apply this system for
any language domain. In our NER system the states are not fixed means it is of dynamic in nature one can use it according to their interest. The corpus used by our NER system is also not domain specific.
Myanmar Named Entity Recognition with Hidden Markov Modelijtsrd
Named Entity Recognition is the process to detect Named Entities NEs in a file , document or from a corpus and to categorize them into certain Named entity classes like name of city, State, Country, organization, person, location, sport, river, quantity etc. This paper introduces the Named Entities Recognition NER for Myanmar language using Hidden Markov Model HMM .The main idea behind the use of HMM language independent and we can apply this system for any language domain. The corpus used by our NER system is also not domain specific. Khin Khin Lay | Aung Cho ""Myanmar Named Entity Recognition with Hidden Markov Model"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd24012.pdf
Paper URL: https://www.ijtsrd.com/computer-science/artificial-intelligence/24012/myanmar-named-entity-recognition-with-hidden-markov-model/khin-khin-lay
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSkevig
Text analysis has been attracting increasing attention in this data era. Selecting effective features from
datasets is a particular important part in text classification studies. Feature selection excludes irrelevant
features from the classification task, reduces the dimensionality of a dataset, and improves the accuracy
and performance of identification. So far, so many feature selection methods have been proposed, however,
it remains unclear which method is the most effective in practice. This article focuses on evaluating and
comparing the available feature selection methods in general versatility regarding authorship attribution
problems and tries to identify which method is the most effective. The discussions on general versatility of
feature selection methods and its connection in selecting the appropriate features for varying data were
done. In addition, different languages, different types of features, different systems for calculating the
accuracy of SVM (support vector machine), and different criteria for determining the rank of feature
selection methods were used to measure the general versatility of these methods together. The analysis
results indicate the best feature selection method is different for each dataset; however, some methods can
always extract useful information to discriminate the classes. The chi-square was proved to be a better
method overall.
Duration for Classification and Regression Treefor Marathi Textto- Speech Syn...IJERA Editor
This research paper reports preliminary results of data-driven modeling of segmentalphoneme duration for
Marathi. Classification and Regression Tree based data driven duration modeling for segmental duration
prediction is presented. A number of features are considered and their usefulness and relative contribution for
segmental duration prediction is assessed. Objective evaluation of the duration model, by root mean squared
prediction error and correlation between actual and predicted durations, is performed.
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSkevig
This article focuses on evaluating and comparing the available feature selection methods in general versatility regarding authorship attribution problems and tries to identify which method is the most effective. The discussions on general versatility of feature selection methods and its connection in selecting the appropriate features for varying data were done. In addition, different languages, different types of features, different systems for calculating the accuracy of SVM (support vector machine), and different criteria for determining the rank of feature selection methods were used to measure the general versatility of these methods together. The analysis results indicate the best feature selection method is different for each dataset; however, some methods can always extract useful information to discriminate the classes. The chi-square was proved to be a better method overall.
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISONIJCSEA Journal
The massive grow of the modern information retrieval system (IRS), especially in natural languages
becomes more difficult. The search in Arabic languages, as natural language, is not good enough yet. This
paper will try to build similar thesaurus based on Arabic language in two mechanisms, the first one is full
word mechanisms and the other is stemmed mechanisms, and then to compare between them.
The comparison made by this study proves that the similar thesaurus using stemmed mechanisms get more
better results than using traditional in the same mechanisms and similar thesaurus improved more the
recall and precision than traditional information retrieval system at recall and precision levels.
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSijnlc
Text analysis has been attracting increasing attention in this data era. Selecting effective features from datasets is a particular important part in text classification studies. Feature selection excludes irrelevant features from the classification task, reduces the dimensionality of a dataset, and improves the accuracy and performance of identification. So far, so many feature selection methods have been proposed, however,
it remains unclear which method is the most effective in practice. This article focuses on evaluating and comparing the available feature selection methods in general versatility regarding authorship attribution problems and tries to identify which method is the most effective. The discussions on general versatility of feature selection methods and its connection in selecting the appropriate features for varying data were
done. In addition, different languages, different types of features, different systems for calculating the accuracy of SVM (support vector machine), and different criteria for determining the rank of feature selection methods were used to measure the general versatility of these methods together. The analysis
results indicate the best feature selection method is different for each dataset; however, some methods can always extract useful information to discriminate the classes. The chi-square was proved to be a better method overall.
In this paper, we made a survey on Word Sense Disambiguation (WSD). Near about in all major languages
around the world, research in WSD has been conducted upto different extents. In this paper, we have gone
through a survey regarding the different approaches adopted in different research works, the State of the
Art in the performance in this domain, recent works in different Indian languages and finally a survey in
Bengali language. We have made a survey on different competitions in this field and the bench mark
results, obtained from those competitions.
Text Mining is the technique that helps users to find out useful information from a large amount of text documents on the web or database. Most popular text mining and classification methods have adopted term-based approaches. The term based approaches and the pattern-based method describing user preferences. This review paper analyse how the text mining work on the three level i.e sentence level, document level and feature level. In this paper we review the related work which is previously done. This paper also demonstrated that what are the problems arise while doing text mining done at the feature level. This paper presents the technique to text mining for the compound sentences.
Arabic text categorization algorithm using vector evaluation methodijcsit
Text categorization is the process of grouping documents into categories based on their contents. This
process is important to make information retrieval easier, and it became more important due to the huge
textual information available online. The main problem in text categorization is how to improve the
classification accuracy. Although Arabic text categorization is a new promising field, there are a few
researches in this field. This paper proposes a new method for Arabic text categorization using vector
evaluation. The proposed method uses a categorized Arabic documents corpus, and then the weights of the
tested document's words are calculated to determine the document keywords which will be compared with
the keywords of the corpus categorizes to determine the tested document's best category.
Domain Specific Named Entity Recognition Using Supervised ApproachWaqas Tariq
This paper introduces Named Entity Recognition approach for textual corpus. Supervised Statistical methods are used to develop our system. Our system can be used to categorize NEs belonging to a particular domain for which it is being trained. As Named Entities appears in text surrounded by contexts (words that are left or right of the NE), we will be focusing on extracting NE contexts from text and then perform statistical computing on them. We are using n-gram modeling for extracting contexts from text. Our methodology first extracts left and right tri-grams surrounding NE instances in the training corpus and calculate their probabilities. Then all the extracted tri-grams along with their calculated probabilities are stored in a file. During testing, system detects unrecognized NEs in the testing corpus and categorize them using the tri-gram probabilities calculated during training time. The proposed system consists of two modules namely Knowledge acquisition and NE Recognition. Knowledge acquisition module extracts the tri-grams surrounding NEs in the training corpus and NE Recognition module performs the categorization of Named Entities in the testing corpus.
Improving a Lightweight Stemmer for Gujarati Languageijistjournal
The origin of route of text mining is the process of stemming. It is usually used in several types of applications such as Natural Language Processing (NLP), Information Retrieval (IR) and Text Mining (TM) including Text Categorization (TC), Text Summarization (TS). Establish a stemmer effective for the language of Gujarati has been always a search domain hot since the Gujarati has a very different structure and difficult that the other language due to the rich morphology.
Similar to A survey of named entity recognition in assamese and other indian languages (20)
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
"Impact of front-end architecture on development cost", Viktor Turskyi
A survey of named entity recognition in assamese and other indian languages
1. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
10.5121/ijnlc.2014.3310 105
A SURVEY OF NAMED ENTITY
RECOGNITION IN ASSAMESE AND OTHER
INDIAN LANGUAGES
Gitimoni Talukdar1
, Pranjal Protim Borah2
, Arup Baruah3
1,2,3
Department of Computer Science and Engineering, Assam Don Bosco University,
Guwahati, India.
Abstract
Named Entity Recognition is always important when dealing with major Natural Language Processing
tasks such as information extraction, question-answering, machine translation, document summarization
etc so in this paper we put forward a survey of Named Entities in Indian Languages with particular
reference to Assamese. There are various rule-based and machine learning approaches available for
Named Entity Recognition. At the very first of the paper we give an idea of the available approaches for
Named Entity Recognition and then we discuss about the related research in this field. Assamese like other
Indian languages is agglutinative and suffers from lack of appropriate resources as Named Entity
Recognition requires large data sets, gazetteer list, dictionary etc and some useful feature like
capitalization as found in English cannot be found in Assamese. Apart from this we also describe some of
the issues faced in Assamese while doing Named Entity Recognition.
Keywords
Named entity recognition; Named entity; Annotated corpora; Gazetteer list; Heuristics.
1. INTRODUCTION
Named Entity is a text element indicating the name of a person, organization and location. It was
felt in the Message Understanding Conferences (MUC) of 1990s that if certain classes of
information are previously extracted from a document then the information extraction task
becomes really easy. In the later stage of the conference Named Entity Recognition systems were
asked to classify the names, time, date and numerical information .Thus Named Entity
Recognition is a task of two stages-first to identify the proper nouns and then to classify the
proper names into categories such as person name, organization names (e.g., government
organizations, companies), location names (e.g., countries and cities) and miscellaneous names
(e.g., percentage, date, number, time, monetary expressions).
In the MUC conference some conventions were followed for the named entities which include
[8]:
• NUMEX for numerical entities
• TIMEX for temporal entities
• ENAMEX for names
Assamese is an Indo European language in North Eastern India spoken by about 30 million
people and a national language of India. It is the official language of the state of Assam .The first
2. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
106
work on Named Entity Recognition for Assamese was rule based Named Entity Recognition
described in [5].
The paper has been further divided into six sections. In section 2 the approaches available for
named entity recognition has been highlighted with elaboration of the features that are commonly
used in named entity recognition in section 3. In section 4 a survey of the research work done in
Named Entity Recognition of Indian languages has been put forwarded followed by discussion of
challenges faced in Assamese Named Entity Recognition in section 5. Section 6 emphasizes on
performance metrics that are used to evaluate an Named Entity Recognition system and section 7
concludes our paper.
2. DIFFERENT APPROACHES
Named Entity Recognition approaches can broadly be classified into three classes namely rule-
based or handcrafted, machine learning or automated and hybrid model. A small overview of the
approaches is outlined below:
2.1. Rule-based Named Entity Recognition
Studies made in Named Entity Recognition initially were mainly based on handcrafted rules. This
can be seen from the fact that five systems were based on handcrafted rules out of eight in the
MUC-7 competition and sixteen presented in CONLL-2003 [8]. Human made rules forms the
main background of rule based Named Entity Recognition. Rule based approach can be further
classified as:
2.1.1. Linguistic Approach
In this approach expertise is required in the grammar and language to ascertain the heuristics for
classifying the named entities. Performance of this approach is based on coming up with good
rules and good heuristics.
2.1.2. List Look up Approach
For this a list called the gazetteer list is needed. This list is to be updated from time to time and
Named Entity Recognition works only for the gazetteer list.
2.2. Automated or Machine learning approach
Machine learning approaches are advantageous over rule based approaches as they can be easily
trained and can be used in various domains. All these approaches are statistical in nature. Some
machine learning approaches are Conditional Random Field (CRF), Hidden Markov Model
(HMM), Decision Trees (DT), Maximum Entropy Markov Model (MEMM) and Support Vector
Machine (SVM).
2.2.1. Conditional Random Field (CRF)
CRFs are used for segmenting and labelling data in sequential mode and is a discriminative
probabilistic model. It requires huge random and non-independent features. In CRF there is a state
sequence S and observed sequence O where observed sequence is the sequence of words in a text
document or sentence and state sequence is the respective label sequence. If the observed
sequence O = (o1,o2,......ok) is provided then the conditional probability of the state sequence S =
(s1,s2,.....sK) is given as
3. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
107
pሺs|oሻ ൌ
1
Z
exp ා Υp pሺSk െ 1, Sk, o,kሻ
୮
୩ୀଵ
Where Z represents the normalization factor.
Z ൌ exp ා Υp pሺSk െ 1, Sk, o,kሻ
୮
୩ୀଵ
Where pሺSk െ 1, Sk, o, kሻ is a feature function and Υp is learned during training phase.
2.2.2. Hidden Markov Model (HMM)
This model is based on maximizing the joint probability of the sequence of words and tags. If the
word sequence is Word= (word1 featureset1). . . . . . . . . . (wordn featuresetn) where wordi indicates
a word and featureseti is the associated single token feature set of wordi then the main aim is to
calculate the tag sequence T = t1,t2,.....tn for which the conditional probability of tag sequence
given the word sequence is maximized.
2.2.3. Decision Trees (DT)
Decision Tree is a tree structure used to make decisions at the nodes and obtain some result at the
leaf nodes .A path in the tree represents a sequence of decisions leading to the classification at the
leaf node. Decision trees are attractive because the rules can be easily understood from the tree .It
is a popular tool for prediction and classification.
2.2.4. Maximum Entropy Markov Model (MEMM)
MEMM is based on the principle that the model which considers all known facts is the one that
maximizes entropy. It can handle long term dependency and words having multiple features
which was a drawback for HMM. It suffers from Label-Bias problem. It is biased towards state
with lower outgoing transition because the probability transition coming out from any given state
must equal to one.
2.2.5. Support Vector Machine (SVM)
In a classification task using SVM the task usually involves training and testing data which
consist of some data instances. The goal is to predict the class of the data instances. It is one of
the famous binary classifier giving best results for fewer data sets and can be applied to multi-
class problems by extending the algorithm. The SVM classifier used in the training set for making
the classifier model and classify the testing data based on this model with the use of features.
2.3. Hybrid model approach
Machine learning approaches and rule based approaches are combined for more accuracy to
Named Entity Recognition. Various hybrid approaches can be CRF and rule based approach,
HMM and rule based approach, MEMM and rule based approach and SVM and rule based
approach.
4. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
108
3. FEATURES IN NER
Features commonly used for Named Entity Recognition are:
3.1. Surrounding words
Various combinations of previous to next words of a word which are the surrounding words can
be treated as a feature.
3.2. Context word feature
Specific list for class can be created for the words occurring quite often at previous and next
positions of a word belonging to a particular class. This feature is set to 1 if the surrounding
words are found in the class context list.
3.3. Digit features
Binary valued digit features can be helpful in detecting the miscellaneous named entities such as:
• ContDigitPeriod: Word contains digits and periods.
• ContDigitComma: Word contains of digits and commas.
• ContDigitSlash: Word contains digits and slash.
• ContDigitHyphen: Word contains digits and hyphen.
• ContDigitPercentage: Word contains digits and percentage.
• ContDigitSpecial: Word contains digits and special symbols.
3.4. Infrequent word
Infrequent or rare word can be found by calculating the frequencies of the words in the corpus of
the training phase and by selecting a cut off frequency to allow only those words as rare that
occurs less than the cut off frequency. This feature is important as it is found that infrequent
words are most probably named entities.
3.5. Word suffix
The word suffix feature can be defined in two ways. The first way is to use fixed length suffix
information of the surrounding words and the other way is to use suffixes of variable length and
matching with the predefined lists of valid suffixes of named entities.
3.6. Word prefix
The word prefix feature can be defined in two ways. The first way is to use fixed length prefix
information of the surrounding words and the other way is to use prefixes of variable length and
matching with the predefined lists of valid prefixes of named entities.
3.7. Part of speech information
The part of speech information of the surrounding words can be used for recognizing named
entities.
5. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
109
3.8. Numerical word
This feature can be set to 1 if a token signifies a number.
3.9. Word length
This feature can be used to check if the length of a word is less than three or not because it is
found that very short words are not named entities.
4. OBSERVATIONS AND DISCUSSIONS
In this section we provide a survey of the research done in Indian languages. A research work was
mentioned by Praneeth M Shishtla, 2008 , “A Character n-gram Based Approach for Improved
Recall in Indian Language Named Entity Recognition” [2] used Telugu corpus containing 4709
named entities from 45,714 tokens with Hindi corpus containing 3140 named entities from 45,380
tokens and English corpus containing 4287 named entities from 45,870 tokens. They used CRF
technique with Character based n-gram technique on two languages Hindi and Telugu. During
training and testing they used morphological analyzers, POS taggers and chunkers along with
feature set having nine features. It was found that Gram n=4 gave a F-measure 45.18% for 35k
words, 42.36% for 30k words, 36.26% for 20k words and 40.96% for 10k words for Hindi. Gram
n=3 gave F-measure of 48.93% for 35k words, 44.48% for 30k words, 35.38% for 20k words and
24.2% for 10k words for Telugu and Gram n=2 gave F-measure up to 68.46% for 35k words,
67.49% for 30k words, 65.59% for 20k words and 52.92% for 10k words for English.
Another work that was reported was “Bengali Named Entity Recognition using Support Vector
Machine” mentioned by Asif Ekbal and Sivaji Bandyopadhyay (2008) [3] used a training corpus
of 1,30,000 tokens with sixteen named entity tags on Bengali with support vector machine
approach provided F-measure of 91.8% on test set of 1,50,000 words.
“A hybrid Approach for Named Entity Recognition in Indian Languages” mentioned by Sujan
Kumar Saha et al.(2008) [4] allowed twelve classes of named entity recognition for Bengali,
Hindi, Telugu , Urdu and Oriya. A combination of maximum entropy model, gazetteers list and
some language dependent rules were used. The system reported F-measure of 65.96% for
Bengali, 65.13% for Hindi, 18.75% for Telugu, 35.47% for Urdu and 44.65% for Oriya.
“Domain Focused Named Entity Recognition for Tamil Using Conditional Random Fields”
mentioned by Vijaya Krishna R and Sobha L (2008) [7] used CRF approach for Tamil focussed
on tourism domain. A corpus of about 94,000 words in tourism domain was used and 106 tag sets
along with five feature template. A part of the corpus was used for training and another part was
used for testing. The system reported a F-measure of 80.44% respectively.
“Language Independent Named Entity Recognition in Indian languages” mentioned by Asif
Ekbal et al., 2008 [1] used CRF approach for named entity recognition. The system applied
language dependent features to Hindi and Bengali only and language independent features on
Bengali, Hindi, Telugu, Urdu and Oriya along with contextual information of words. The system
was trained with 502,974 tokens for Hindi, 93,173 tokens for Oriya, 64,026 tokens for Telugu,
35,447 tokens for Urdu and 122,467 tokens for Bengali. The system was tested with 38,708
tokens for Hindi, 24,640 tokens for Oriya, 6,356 tokens for Telugu, 3,782 tokens for Urdu and
30,505 tokens for Bengali. An F-measure of 53.46% was obtained for Bengali.
Padmaja Sharma, Utpal Sharma and Jugal Kalita 2010, “The first Steps towards Assamese
Named Entity Recognition” [5] developed a handcrafted rule based Named Entity Recognition for
6. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
110
Assamese. A corpus of about 50000 words was manually tagged from Assamese online Protidin
articles. The system found 500 person names and 250 location names. They analyzed the tagged
corpus to enumerate some rules for automatic Named Entity tagging.
“Suffix Stripping Based Named Entity Recognition in Assamese for Location Names” mentioned
by Padmaja Sharma, Utpal Sharma and Jugal Kalita (2010) [6] used an Assamese Pratidin Corpus
containing 300,000 wordforms. A location named entity was produced by generating the root
word by removing suffixes. In their approach the Assamese word occurring in the text is the input
and a list of suffixes is used which commonly combines with location named entities. The
stemmer removes the suffixes from the input word by searching for the suffixes in the key suffix
file and if a match is found the suffix along with the trailing characters is removed to produce the
output which is the location named entity. The approach was simple and obtained an F-measure
of 88%.
Table 1. F-measure achieved in Hindi for different statistical approach.
Approach Used F-measure (%)
MEMM [4] 65.13
Character based n-gram technique [2] 45.18
Language dependent features [1] 33.12
Table 2. F-measure achieved in Bengali for different statistical approach.
Approach Used F-measure (%)
MEMM [4] 65.96
Language dependent features [1] 59.39
SVM [3] 91.8
Table 3. F-measure achieved in Telegu for different statistical approach.
Approach Used F-measure (%)
MEMM [4] 18.75
CRF [13] 92
Language independent features [1] 47.49
Character based n-gram technique [2] 48.93
Table 4. F-measure achieved in Oriya for different statistical approach.
Approach Used F-measure (%)
MEMM [4] 44.65
Language independent features [1] 28.71
Table 5. F-measure achieved in Assamese for Suffix stripping based approach.
Approach Used F-measure (%)
Suffix stripping based approach [6] 88
7. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
111
Table 6. F-measure achieved in Urdu for different statistical approach.
Approach Used F-measure (%)
MEMM [4] 35.47
Language independent features [1] 35.52
5. KEY ISSUES IN ASSAMESE NAMED ENTITY RECOGNITION
5.1. Ambiguity in Assamese
Ambiguity occurs between common noun and proper noun as most of the names are taken from
dictionary. For example, я (Jon) indicates moon which is a common noun but may also indicate
the name of a person which is a proper noun.
5.2. Agglutinative nature
Assamese language suffers from agglutination and complex words are created by adding
additional features to change the meaning of the word. For example, a (Assam) is the name of a
place which is a location named entity but a (AssamIYA) is produced by adding suffix
(IYA) to a (Assam) which signifies people residing in Assam which is not a location named
entity [5].
5.3. Lack of capitalization
In Assamese there is no capitalization that can help to recognize the proper nouns as found in
English.
5.4. Nested Entities
When two or more proper nouns are present then it becomes difficult to assign the proper named
entity class. For example, In (Gauhati bishabidyaly) (Gauhati) is a location
named entity and (bishabidyalay) refers to organization thus creating problem in
assigning the proper class.
5.5. Spelling Variation
Changes in the spelling of proper names are another problem in Assamese Named Entity
Recognition. For example, In n (Shree Shreesanth) there is a confusion whether (Shree) in
n (Shreesanth) is a Pre-nominal word or person named entity.
6. PERFORMANCE METRICS
The level of accuracy of a system to recognize Named Entity Recognition can be done by the
following metrics:
Precision (P): Precision is the ratio of relevant results returned to total number of results returned.
It can be represented as: P = W/Y
Recall (R): Recall is the ratio of relevant results returned to all relevant results. It can be
represented as: R= W/T
8. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
112
F1-measure: 2*(P*R) / (P+R)
Where,
W= Number of relevant Named Entities returned.
Y= Total Named Entities returned.
T= Total Named Entities present.
7. CONCLUSION
Indian languages suffer greatly from lack of available annotated corpora, agglutinative nature,
different writing methodologies, difficult morphology and no concept of capitalization and as
such research of Named Entity Recognition in Indian languages are not much as compared to the
European languages. We found that rule based approaches with gazetteer list along with some
language independent rules combined together with statistical approach may give satisfactory
results for Named Entity Recognition in Indian languages because of insufficient data available
for training. Our conclusion is that in a situation where sufficient training data is not available a
hybrid model where combination of rule based, statistical and language independent rules are
used will be a better approach to perform Named Entity Recognition in Indian languages.
REFERENCES
Authors :
Gitimoni Talukdar.
B.E. (CSE), Research Scholar.
Pranjal Protim Borah.
B.E. (CSE), Research Scholar.
Arup Baruah.
B.E (CSE), M.Tech (IT)
Assistant Professor, Department of CSE,
Assam Don Bosco University, Guwahati, India.