Information Extraction (IE) is a sub discipline of Artificial Intelligence. IE identifies information in
unstructured information source that adheres to predefined semantics i.e. people, location etc. Recognition
of named entities (NEs) from computer readable natural language text is significant task of IE and natural
language processing (NLP). Named entity (NE) extraction is important step for processing unstructured
content. Unstructured data is computationally opaque. Computers require computationally transparent
data for processing. IE adds meaning to raw data so that it can be easily processed by computers. There
are various different approaches that are applied for extraction of entities from text. This paper elaborates
need of NE recognition for Marathi and discusses issues and challenges involved in NE recognition tasks
for Marathi language. It also explores various methods and techniques that are useful for creation of
learning resources and lexicons that are important for extraction of NEs from natural language
unstructured text.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Named Entity Recognition System for Hindi Language: A Hybrid ApproachWaqas Tariq
Named Entity Recognition (NER) is a major early step in Natural Language Processing (NLP) tasks like machine translation, text to speech synthesis, natural language understanding etc. It seeks to classify words which represent names in text into predefined categories like location, person-name, organization, date, time etc. In this paper we have used a combination of machine learning and Rule based approaches to classify named entities. The paper introduces a hybrid approach for NER. We have experimented with Statistical approaches like Conditional Random Fields (CRF) & Maximum Entropy (MaxEnt) and Rule based approach based on the set of linguistic rules. Linguistic approach plays a vital role in overcoming the limitations of statistical models for morphologically rich language like Hindi. Also the system uses voting method to improve the performance of the NER system. Keywords: NER, MaxEnt, CRF, Rule base, Voting, Hybrid approach
Myanmar named entity corpus and its use in syllable-based neural named entity...IJECEIAES
Myanmar language is a low-resource language and this is one of the main reasons why Myanmar Natural Language Processing lagged behind compared to other languages. Currently, there is no publicly available named entity corpus for Myanmar language. As part of this work, a very first manually annotated Named Entity tagged corpus for Myanmar language was developed and proposed to support the evaluation of named entity extraction. At present, our named entity corpus contains approximately 170,000 name entities and 60,000 sentences. This work also contributes the first evaluation of various deep neural network architectures on Myanmar Named Entity Recognition. Experimental results of the 10-fold cross validation revealed that syllable-based neural sequence models without additional feature engineering can give better results compared to baseline CRF model. This work also aims to discover the effectiveness of neural network approaches to textual processing for Myanmar language as well as to promote future research works on this understudied language.
SENTIMENT ANALYSIS OF MIXED CODE FOR THE TRANSLITERATED HINDI AND MARATHI TEXTSijnlc
The evolution of information Technology has led to the collection of large amount of data, the volume of
which has increased to the extent that in last two years the data produced is greater than all the data ever
recorded in human history. This has necessitated use of machines to understand, interpret and apply data,
without manual involvement. A lot of these texts are available in transliterated code-mixed form, which due
to the complexity are very difficult to analyze. The work already performed in this area is progressing at
great pace and this work hopes to be a way to push that work further. The designed system is an effort
which classifies Hindi as well as Marathi text transliterated (Romanized) documents automatically using
supervised learning methods (KNN), Naïve Bayes and Support Vector Machine (SVM)) and ontology based
classification; and results are compared to in order to decide which methodology is better suited in
handling of these documents. As we will see, the plain machine learning algorithm applications are just as
or in many cases are much better in performance than the more analytical approach.
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGEScsandit
ABSTRACT
Natural Language processing is an interdisciplinary branch of linguistic and computer science studied under the Artificial Intelligence (AI) that gave birth to an allied area called
‘Computational Linguistic’ which focuses on processing of natural languages on computational devices. A natural language consists of a large number of sentences which are linguistic units involving one or more words linked together in accordance with a set of predefined rules called grammar. Grammar checking is the task of validating sentences syntactically and is a prominent tool within language engineering. Our review draws on the recent development of various grammar checkers to look at past, present and the future in a new light. Our review covers grammar checkers of many languages with the aim of seeking their approaches, methodologies for developing new tool and system as a whole. The survey concludes with the discussion of various features included in existing grammar checkers of foreign languages as well as a few Indian Languages.
A New Approach to Parts of Speech Tagging in Malayalamijcsit
Parts-of-speech tagging is the process of labeling each word in a sentence. A tag mentions the word’s
usage in the sentence. Usually, these tags indicate syntactic classification like noun or verb, and sometimes
include additional information, with case markers (number, gender etc) and tense markers. A large number
of current language processing systems use a parts-of-speech tagger for pre-processing.
There are mainly two approaches usually followed in Parts of Speech Tagging. Those are Rule based
Approach and Stochastic Approach. Rule based Approach use predefined handwritten rules. This is the
oldest approach and it use lexicon or dictionary for reference. Stochastic Approach use probabilistic and
statistical information to assign tag to words. It use large corpus, so that Time complexity and Space
complexity is high whereas Rule base approach has less complexity for both Time and Space. Stochastic
Approach is the widely used one nowadays because of its accuracy.
Malayalam is a Dravidian family of languages, inflectional with suffixes with the root word forms. The
currently used Algorithms are efficient Machine Learning Algorithms but these are not built for
Malayalam. So it affects the accuracy of the result of Malayalam POS Tagging.
My proposed Approach use Dictionary entries along with adjacent tag information. This algorithm use
Multithreaded Technology. Here tagging done with the probability of the occurrence of the sentence
structure along with the dictionary entry.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Named Entity Recognition System for Hindi Language: A Hybrid ApproachWaqas Tariq
Named Entity Recognition (NER) is a major early step in Natural Language Processing (NLP) tasks like machine translation, text to speech synthesis, natural language understanding etc. It seeks to classify words which represent names in text into predefined categories like location, person-name, organization, date, time etc. In this paper we have used a combination of machine learning and Rule based approaches to classify named entities. The paper introduces a hybrid approach for NER. We have experimented with Statistical approaches like Conditional Random Fields (CRF) & Maximum Entropy (MaxEnt) and Rule based approach based on the set of linguistic rules. Linguistic approach plays a vital role in overcoming the limitations of statistical models for morphologically rich language like Hindi. Also the system uses voting method to improve the performance of the NER system. Keywords: NER, MaxEnt, CRF, Rule base, Voting, Hybrid approach
Myanmar named entity corpus and its use in syllable-based neural named entity...IJECEIAES
Myanmar language is a low-resource language and this is one of the main reasons why Myanmar Natural Language Processing lagged behind compared to other languages. Currently, there is no publicly available named entity corpus for Myanmar language. As part of this work, a very first manually annotated Named Entity tagged corpus for Myanmar language was developed and proposed to support the evaluation of named entity extraction. At present, our named entity corpus contains approximately 170,000 name entities and 60,000 sentences. This work also contributes the first evaluation of various deep neural network architectures on Myanmar Named Entity Recognition. Experimental results of the 10-fold cross validation revealed that syllable-based neural sequence models without additional feature engineering can give better results compared to baseline CRF model. This work also aims to discover the effectiveness of neural network approaches to textual processing for Myanmar language as well as to promote future research works on this understudied language.
SENTIMENT ANALYSIS OF MIXED CODE FOR THE TRANSLITERATED HINDI AND MARATHI TEXTSijnlc
The evolution of information Technology has led to the collection of large amount of data, the volume of
which has increased to the extent that in last two years the data produced is greater than all the data ever
recorded in human history. This has necessitated use of machines to understand, interpret and apply data,
without manual involvement. A lot of these texts are available in transliterated code-mixed form, which due
to the complexity are very difficult to analyze. The work already performed in this area is progressing at
great pace and this work hopes to be a way to push that work further. The designed system is an effort
which classifies Hindi as well as Marathi text transliterated (Romanized) documents automatically using
supervised learning methods (KNN), Naïve Bayes and Support Vector Machine (SVM)) and ontology based
classification; and results are compared to in order to decide which methodology is better suited in
handling of these documents. As we will see, the plain machine learning algorithm applications are just as
or in many cases are much better in performance than the more analytical approach.
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGEScsandit
ABSTRACT
Natural Language processing is an interdisciplinary branch of linguistic and computer science studied under the Artificial Intelligence (AI) that gave birth to an allied area called
‘Computational Linguistic’ which focuses on processing of natural languages on computational devices. A natural language consists of a large number of sentences which are linguistic units involving one or more words linked together in accordance with a set of predefined rules called grammar. Grammar checking is the task of validating sentences syntactically and is a prominent tool within language engineering. Our review draws on the recent development of various grammar checkers to look at past, present and the future in a new light. Our review covers grammar checkers of many languages with the aim of seeking their approaches, methodologies for developing new tool and system as a whole. The survey concludes with the discussion of various features included in existing grammar checkers of foreign languages as well as a few Indian Languages.
A New Approach to Parts of Speech Tagging in Malayalamijcsit
Parts-of-speech tagging is the process of labeling each word in a sentence. A tag mentions the word’s
usage in the sentence. Usually, these tags indicate syntactic classification like noun or verb, and sometimes
include additional information, with case markers (number, gender etc) and tense markers. A large number
of current language processing systems use a parts-of-speech tagger for pre-processing.
There are mainly two approaches usually followed in Parts of Speech Tagging. Those are Rule based
Approach and Stochastic Approach. Rule based Approach use predefined handwritten rules. This is the
oldest approach and it use lexicon or dictionary for reference. Stochastic Approach use probabilistic and
statistical information to assign tag to words. It use large corpus, so that Time complexity and Space
complexity is high whereas Rule base approach has less complexity for both Time and Space. Stochastic
Approach is the widely used one nowadays because of its accuracy.
Malayalam is a Dravidian family of languages, inflectional with suffixes with the root word forms. The
currently used Algorithms are efficient Machine Learning Algorithms but these are not built for
Malayalam. So it affects the accuracy of the result of Malayalam POS Tagging.
My proposed Approach use Dictionary entries along with adjacent tag information. This algorithm use
Multithreaded Technology. Here tagging done with the probability of the occurrence of the sentence
structure along with the dictionary entry.
Author Credits - Maaz Nomani
A Proposition Bank is a collection of sentences which are hand-annotated with the information of semantic labels in the respective sentences. Currently, around 10,000 sentences containing 0.2 million words have been hand-annotated with the semantic labels information.
This is a natural language resource of very rich linguistic information which can be used in a variety of NLP applications such as semantic parsing, syntactic parsing, sentiment analysis, dialogue systems etc.
In this paper, we present one such resource for a resource-poor Indian language Urdu. The Proposition Bank of Urdu is built on already built Treebank of Urdu (A Treebank is a corpus of sentences annotated with their POS, morphological, head, TAM and dependency labels information). A Propbank adds a layer of semantic information over this Treebank and hence can facilitate semantic parsing and other semantic level operations in a natural language sentence.
Kannada Phonemes to Speech Dictionary: Statistical ApproachIJERA Editor
The input or output of a natural Language processing system can be either written text or speech. To process written text we need to analyze: lexical, syntactic, semantic knowledge about the language, discourse information, real world knowledge to process spoken language, we need to analyze everything required to process written text, along with the challenges of speech recognition and speech synthesis. This paper describes how articulatory phonetics of Kannada is used to generate the phoneme to speech dictionary for Kannada; the statistical computational approach is used to map the elements which are taken from input query or documents. The articulatory phonetics is the place of articulation of a consonant. It is the point of contact where an obstruction occurs in the vocal tract between an articulatory gesture, an active articulator, typically some part of the tongue, and a passive location, typically some part of the roof of the mouth. Along with the manner of articulation and the phonation, this gives the consonant its distinctive sound. The results are presented for the same.
Language Identification from a Tri-lingual Printed Document: A Simple ApproachIJERA Editor
In a multi-script, multilingual country like India, a document may contain text lines in more than one script/language forms. For such a multi-script environment, multilingual Optical Character Recognition (OCR) system is needed to read the multi-script documents. To make a multilingual OCR system successful, it is necessary to identify different script regions of the document before feeding the document to the OCRs of individual language. With this context, this paper proposes to work on the prioritized requirements of a particular region- Andra Pradesh, a state in India, where any document including official ones, would contain the text in three languages-Telugu, Hindi and English. So, the objective of this paper is to develop a system that should aim to accurately identify and separate Telugu, Hindi and English text lines from a printed multilingual document and also to group the portion of the document in other than these three languages into a separate category OTHERS. The proposed method is developed by thoroughly understanding the nature of top and bottom profiles of the printed text lines. Experimentation conducted involved 900 text lines for learning and 900 text lines for testing. The performance has turned out to be 95.67%.
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATAijnlc
Due to the dramatic growth of internet use, the amount of unstructured Bengali text data has increased
enormous. It is therefore essential to extract event intelligently from it. The progress in technologies in
natural language processing (NLP) for information extraction that is used to locate and classify content in
news data according to predefined categories such as person name, place name, organization name, date,
time etc. The current named entity recognition (NER), which is a subtask of NLP, plays a vital rule to
achieve human level performance on specific documents such as newspapers to effectively identify entities.
The purpose of this research is to introduce NER system in Bengali news data to identify events of specified
things in running text based on regular expression and Bengali grammar. In so doing, I have designed and
evaluated part-of-speech (POS) tags to recognize proper nouns. In this thesis, I have explained Hidden
Markov Model (HMM) based approach for developing NER system from Bengali news data.
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...Syeful Islam
More than hundreds of millions of people of almost all levels of education and attitudes from different country communicate with each other for different purposes using various languages. Machine translation is highly demanding due to increasing the usage of web based Communication. One of the major problem of Bengali translation is identified a naming word from a sentence, which is relatively simple in English language, because such entities start with a capital letter. In Bangla we do not have concept of small or capital letters and there is huge no. of different naming entity available in Bangla. Thus we find difficulties in understanding whether a word is a naming word or not. Here we have introduced a new approach to identify naming word from a Bengali sentence for machine translation system without storing huge no. of naming entity in word dictionary. The goal is to make possible Bangla sentence conversion with minimal storing word in dictionary.
Developemnt and evaluation of a web based question answering system for arabi...ijnlc
Question Answering (QA) systems are gaining great importance due to the increasing amount of web
content and the high demand for digital information that regular information retrieval techniques cannot
satisfy. A question answering system enables users to have a natural language dialog with the machine,
which is required for virtually all emerging online service systems on the Internet. The need for such
systems is higher in the context of the Arabic language. This is because of the scarcity of Arabic QA
systems, which can be attributed to the great challenges they present to the research community,including
theparticularities of Arabic, such as short vowels, absence of capital letters, complex morphology, etc. In
this paper, we report the design and implementation of an Arabic web-based question answering
system,which we called “JAWEB”, the Arabic word for the verb “answer”. Unlike all Arabic questionanswering
systems, JAWEB is a web-based application,so it can be accessed at any time and from
anywhere. Evaluating JAWEBshowed that it gives the correct answer with 100% recall and 80% precision
on average. When comparedto ask.com, the well-established web-based QA system, JAWEBprovided 15-
20% higher recall.These promising results give clear evidence that JAWEB has great potential as a QA
platform and is much needed by Arabic-speaking Internet users across the world.
Hidden markov model based part of speech tagger for sinhala languageijnlc
In this paper we present a fundamental lexical semantics of Sinhala language and a Hidden Markov Model (HMM) based Part of Speech (POS) Tagger for Sinhala language. In any Natural Language processing task, Part of Speech is a very vital topic, which involves analysing of the construction, behaviour and the dynamics of the language, which the knowledge could utilized in computational linguistics analysis and automation applications. Though Sinhala is a morphologically rich and agglutinative language, in which words are inflected with various grammatical features, tagging is very essential for further analysis of the language. Our research is based on statistical based approach, in which the tagging process is done by computing the tag sequence probability and the word-likelihood probability from the given corpus, where the linguistic knowledge is automatically extracted from the annotated corpus. The current tagger could reach more than 90% of accuracy for known words.
WRITER RECOGNITION FOR SOUTH INDIAN LANGUAGES USING STATISTICAL FEATURE EXTRA...ijnlc
The comprehension of whole manually written records is a testing issue which incorporates various
difficult undertakings. Given a written by hand archive, its format needs to be dissected to detach different
content sorts in a first step. These different content sorts can then be coordinated to specific frameworks,
including writer style, image, or table recognizers. Research in programmed author recognizable proof has
principally centred around the measurable methodology. This has prompted the particular and extraction
of factual elements, for example, run-length appropriations, incline dissemination, entropy, and edge-pivot
conveyance. The edge-pivot conveyance highlight out flanks all other measurable elements. Edge-pivot
circulation is an element that portrays the adjustments in bearing of a written work stroke in written by
hand content. The edge- pivot circulation is extricated by method for a window that is slid over an edgerecognized
on offline scanned images. At whatever point the focal pixel of the window is on, the two edge
pieces (i.e. associated successions of pixels) rising up out of this focal pixel are considered. Their bearings
are measured and put away as sets. A joint likelihood dissemination is gotten from an extensive recognition
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONkevig
Phonetic typing using the English alphabet has become widely popular nowadays for social media and chat services. As a result, a text containing various English and Bangla words and phrases has become increasingly common. Existing transliteration tools display poor performance for such texts. This paper proposes a robust Three-stage Hybrid Transliteration (THT) framework that can transliterate both English words and phonetic typed Bangla words satisfactorily. This is achieved by adopting a hybrid approach of dictionary-based and rule-based techniques. Experimental results confirm superiority of THT as it significantly outperforms the benchmark transliteration tool.
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...kevig
This study proposes a method to develop neural models of the morphological analyzer for Japanese Hiragana sentences using the Bi-LSTM CRF model. Morphological analysis is a technique that divides text data into words and assigns information such as parts of speech. In Japanese natural language processing systems, this technique plays an essential role in downstream applications because the Japanese language does not have word delimiters between words. Hiragana is a type of Japanese phonogramic characters, which is used for texts for children or people who cannot read Chinese characters. Morphological analysis of Hiragana sentences is more difficult than that of ordinary Japanese sentences because there is less information for dividing. For morphological analysis of Hiragana sentences, we demonstrated the effectiveness of fine-tuning using a model based on ordinary Japanese text and examined the influence of training data on texts of various genres.
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...kevig
In this paper, phoneme sequences are used as language information to perform code-switched language
identification (LID). With the one-pass recognition system, the spoken sounds are converted into
phonetically arranged sequences of sounds. The acoustic models are robust enough to handle multiple
languages when emulating multiple hidden Markov models (HMMs). To determine the phoneme similarity
among our target languages, we reported two methods of phoneme mapping. Statistical phoneme-based
bigram language models (LM) are integrated into speech decoding to eliminate possible phone
mismatches. The supervised support vector machine (SVM) is used to learn to recognize the phonetic
information of mixed-language speech based on recognized phone sequences. As the back-end decision is
taken by an SVM, the likelihood scores of segments with monolingual phone occurrence are used to
classify language identity. The speech corpus was tested on Sepedi and English languages that are often
mixed. Our system is evaluated by measuring both the ASR performance and the LID performance
separately. The systems have obtained a promising ASR accuracy with data-driven phone merging
approach modelled using 16 Gaussian mixtures per state. In code-switched speech and monolingual
speech segments respectively, the proposed systems achieved an acceptable ASR and LID accuracy.
Transliteration by orthography or phonology for hindi and marathi to english ...ijnlc
e-Governance and Web based online commercial multilingual applications has given utmost importance to
the task of translation and transliteration. The Named Entities and Technical Terms occur in the source
language of translation are called out of vocabulary words as they are not available in the multilingual
corpus or dictionary used to support translation process. These Named Entities and Technical Terms need
to be transliterated from source language to target language without losing their phonetic properties. The
fundamental problem in India is that there is no set of rules available to write the spellings in English for
Indian languages according to the linguistics. People are writing different spellings for the same name at
different places. This fact certainly affects the Top-1 accuracy of the transliteration and in turn the
translation process. Major issue noticed by us is the transliteration of named entities consisting three
syllables or three phonetic units in Hindi and Marathi languages where people use mixed approach to
write the spelling either by orthographical approach or by phonological approach. In this paper authors
have provided their opinion through experimentation about appropriateness of either approach.
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...kevig
This study extracted and analyzed the linguistic speech patterns that characterize Japanese anime or game characters. Conventional morphological analyzers, such as MeCab, segment words with high performance, but they are unable to segment broken expressions or utterance endings that are not listed in the dictionary, which often appears in lines of anime or game characters. To overcome this challenge, we propose segmenting lines of Japanese anime or game characters using subword units that were proposed mainly for deep learning, and extracting frequently occurring strings to obtain expressions that characterize their utterances. We analyzed the subword units weighted by TF/IDF according to gender, age, and each anime character and show that they are linguistic speech patterns that are specific for each feature. Additionally, a classification experiment shows that the model with subword units outperformed that with the conventional method.
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...ijnlc
In this paper we will present our work in studying the sublanguage of Arabic SMS-based classified ads.
This study is presented from the developer's point of view. We will use the corpus collected from an
operational system, CATS. We also compare the SMS-based and the Web-based messages. We also discuss
some quantitative properties of the studied text.
A Novel Approach for Recognizing Text in Arabic Ancient Manuscripts ijnlc
In this paper a system for recognizing Arabic ancient manuscripts is presented. The system has been
divided into four parts. The first part is the image pre-processing where the text in the Arabic ancient
manuscript will be recognized as a collection of Arabic characters through three phases of processing. The
second part is the Arabic text analysis which consists of lexical analyzer; syntax analyzer; and semantic
analyzer. The output of this subsystem is an XML file format that represents the ancient manuscript text.
The third part is the intermediate text generation, in this part an intermediate presentation of the Arabic
text is generated from the XML text file. The fourth part of the system is the Arabic text generation, which
converts the generated text to a modern standard Arabic (MSA) language (this part has four phases: text
organizer; pre-optimizer; semantics generator; and post-optimizer).
In this paper firstly I have compared Single Label Text Categorization with Multi Label Text Categorization in detail then I have compared Document Pivoted Categorization with Category Pivoted Categorization in detail. For this purpose I have given the general definition of Text Categorization with its mathematical notation for the purpose of its frugality and cost effectiveness. Then with the help of mathematical notation and set theory ,I have converted the general definitions of Single Label Text Categorization and Multi Label Text Categorization into their respective mathematical representation .Then I discussed Binary Text Categorization as a special case of Single Label Text Categorization. After comparison of Single Label Text Categorization with Multi Label Text Categorization, I found that Single Label Text Categorization or Binary Text Categorization is more general than Multi Label Text Categorization. Thereafter I discussed an algorithm for transformation of Multi Label Classification into Binary Classification and explained the conditions of transformation of Multi Label Classification into Binary Classification. In the second step I compared Document Pivoted Categorization with Category Pivoted Categorization in detail. After comparison we found that Category Pivoted Categorization is more typical and complex than Document Pivoted Categorization. The Category Pivoted Categorization becomes more complicated when new category is added to predefined set of categories and the recurrent classification of documents takes place. Finally I compared Hard Categorization with Ranking Categorization. After comparing them I found that Hard Categorization incorporates ‘Hard Decisions’ about the relevance or belonging of a document to a category. This hard decision is either completely true or completely false. Whereas the Ranking Categorization creates a belonging of a document to a category
according to the estimated appropriateness to the document. The final Ranked List is developed in the Ranking Categorization which is used by the human expert for final decision of Text Categorization.
Author Credits - Maaz Nomani
A Proposition Bank is a collection of sentences which are hand-annotated with the information of semantic labels in the respective sentences. Currently, around 10,000 sentences containing 0.2 million words have been hand-annotated with the semantic labels information.
This is a natural language resource of very rich linguistic information which can be used in a variety of NLP applications such as semantic parsing, syntactic parsing, sentiment analysis, dialogue systems etc.
In this paper, we present one such resource for a resource-poor Indian language Urdu. The Proposition Bank of Urdu is built on already built Treebank of Urdu (A Treebank is a corpus of sentences annotated with their POS, morphological, head, TAM and dependency labels information). A Propbank adds a layer of semantic information over this Treebank and hence can facilitate semantic parsing and other semantic level operations in a natural language sentence.
Kannada Phonemes to Speech Dictionary: Statistical ApproachIJERA Editor
The input or output of a natural Language processing system can be either written text or speech. To process written text we need to analyze: lexical, syntactic, semantic knowledge about the language, discourse information, real world knowledge to process spoken language, we need to analyze everything required to process written text, along with the challenges of speech recognition and speech synthesis. This paper describes how articulatory phonetics of Kannada is used to generate the phoneme to speech dictionary for Kannada; the statistical computational approach is used to map the elements which are taken from input query or documents. The articulatory phonetics is the place of articulation of a consonant. It is the point of contact where an obstruction occurs in the vocal tract between an articulatory gesture, an active articulator, typically some part of the tongue, and a passive location, typically some part of the roof of the mouth. Along with the manner of articulation and the phonation, this gives the consonant its distinctive sound. The results are presented for the same.
Language Identification from a Tri-lingual Printed Document: A Simple ApproachIJERA Editor
In a multi-script, multilingual country like India, a document may contain text lines in more than one script/language forms. For such a multi-script environment, multilingual Optical Character Recognition (OCR) system is needed to read the multi-script documents. To make a multilingual OCR system successful, it is necessary to identify different script regions of the document before feeding the document to the OCRs of individual language. With this context, this paper proposes to work on the prioritized requirements of a particular region- Andra Pradesh, a state in India, where any document including official ones, would contain the text in three languages-Telugu, Hindi and English. So, the objective of this paper is to develop a system that should aim to accurately identify and separate Telugu, Hindi and English text lines from a printed multilingual document and also to group the portion of the document in other than these three languages into a separate category OTHERS. The proposed method is developed by thoroughly understanding the nature of top and bottom profiles of the printed text lines. Experimentation conducted involved 900 text lines for learning and 900 text lines for testing. The performance has turned out to be 95.67%.
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATAijnlc
Due to the dramatic growth of internet use, the amount of unstructured Bengali text data has increased
enormous. It is therefore essential to extract event intelligently from it. The progress in technologies in
natural language processing (NLP) for information extraction that is used to locate and classify content in
news data according to predefined categories such as person name, place name, organization name, date,
time etc. The current named entity recognition (NER), which is a subtask of NLP, plays a vital rule to
achieve human level performance on specific documents such as newspapers to effectively identify entities.
The purpose of this research is to introduce NER system in Bengali news data to identify events of specified
things in running text based on regular expression and Bengali grammar. In so doing, I have designed and
evaluated part-of-speech (POS) tags to recognize proper nouns. In this thesis, I have explained Hidden
Markov Model (HMM) based approach for developing NER system from Bengali news data.
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...Syeful Islam
More than hundreds of millions of people of almost all levels of education and attitudes from different country communicate with each other for different purposes using various languages. Machine translation is highly demanding due to increasing the usage of web based Communication. One of the major problem of Bengali translation is identified a naming word from a sentence, which is relatively simple in English language, because such entities start with a capital letter. In Bangla we do not have concept of small or capital letters and there is huge no. of different naming entity available in Bangla. Thus we find difficulties in understanding whether a word is a naming word or not. Here we have introduced a new approach to identify naming word from a Bengali sentence for machine translation system without storing huge no. of naming entity in word dictionary. The goal is to make possible Bangla sentence conversion with minimal storing word in dictionary.
Developemnt and evaluation of a web based question answering system for arabi...ijnlc
Question Answering (QA) systems are gaining great importance due to the increasing amount of web
content and the high demand for digital information that regular information retrieval techniques cannot
satisfy. A question answering system enables users to have a natural language dialog with the machine,
which is required for virtually all emerging online service systems on the Internet. The need for such
systems is higher in the context of the Arabic language. This is because of the scarcity of Arabic QA
systems, which can be attributed to the great challenges they present to the research community,including
theparticularities of Arabic, such as short vowels, absence of capital letters, complex morphology, etc. In
this paper, we report the design and implementation of an Arabic web-based question answering
system,which we called “JAWEB”, the Arabic word for the verb “answer”. Unlike all Arabic questionanswering
systems, JAWEB is a web-based application,so it can be accessed at any time and from
anywhere. Evaluating JAWEBshowed that it gives the correct answer with 100% recall and 80% precision
on average. When comparedto ask.com, the well-established web-based QA system, JAWEBprovided 15-
20% higher recall.These promising results give clear evidence that JAWEB has great potential as a QA
platform and is much needed by Arabic-speaking Internet users across the world.
Hidden markov model based part of speech tagger for sinhala languageijnlc
In this paper we present a fundamental lexical semantics of Sinhala language and a Hidden Markov Model (HMM) based Part of Speech (POS) Tagger for Sinhala language. In any Natural Language processing task, Part of Speech is a very vital topic, which involves analysing of the construction, behaviour and the dynamics of the language, which the knowledge could utilized in computational linguistics analysis and automation applications. Though Sinhala is a morphologically rich and agglutinative language, in which words are inflected with various grammatical features, tagging is very essential for further analysis of the language. Our research is based on statistical based approach, in which the tagging process is done by computing the tag sequence probability and the word-likelihood probability from the given corpus, where the linguistic knowledge is automatically extracted from the annotated corpus. The current tagger could reach more than 90% of accuracy for known words.
WRITER RECOGNITION FOR SOUTH INDIAN LANGUAGES USING STATISTICAL FEATURE EXTRA...ijnlc
The comprehension of whole manually written records is a testing issue which incorporates various
difficult undertakings. Given a written by hand archive, its format needs to be dissected to detach different
content sorts in a first step. These different content sorts can then be coordinated to specific frameworks,
including writer style, image, or table recognizers. Research in programmed author recognizable proof has
principally centred around the measurable methodology. This has prompted the particular and extraction
of factual elements, for example, run-length appropriations, incline dissemination, entropy, and edge-pivot
conveyance. The edge-pivot conveyance highlight out flanks all other measurable elements. Edge-pivot
circulation is an element that portrays the adjustments in bearing of a written work stroke in written by
hand content. The edge- pivot circulation is extricated by method for a window that is slid over an edgerecognized
on offline scanned images. At whatever point the focal pixel of the window is on, the two edge
pieces (i.e. associated successions of pixels) rising up out of this focal pixel are considered. Their bearings
are measured and put away as sets. A joint likelihood dissemination is gotten from an extensive recognition
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATIONkevig
Phonetic typing using the English alphabet has become widely popular nowadays for social media and chat services. As a result, a text containing various English and Bangla words and phrases has become increasingly common. Existing transliteration tools display poor performance for such texts. This paper proposes a robust Three-stage Hybrid Transliteration (THT) framework that can transliterate both English words and phonetic typed Bangla words satisfactorily. This is achieved by adopting a hybrid approach of dictionary-based and rule-based techniques. Experimental results confirm superiority of THT as it significantly outperforms the benchmark transliteration tool.
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...kevig
This study proposes a method to develop neural models of the morphological analyzer for Japanese Hiragana sentences using the Bi-LSTM CRF model. Morphological analysis is a technique that divides text data into words and assigns information such as parts of speech. In Japanese natural language processing systems, this technique plays an essential role in downstream applications because the Japanese language does not have word delimiters between words. Hiragana is a type of Japanese phonogramic characters, which is used for texts for children or people who cannot read Chinese characters. Morphological analysis of Hiragana sentences is more difficult than that of ordinary Japanese sentences because there is less information for dividing. For morphological analysis of Hiragana sentences, we demonstrated the effectiveness of fine-tuning using a model based on ordinary Japanese text and examined the influence of training data on texts of various genres.
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...kevig
In this paper, phoneme sequences are used as language information to perform code-switched language
identification (LID). With the one-pass recognition system, the spoken sounds are converted into
phonetically arranged sequences of sounds. The acoustic models are robust enough to handle multiple
languages when emulating multiple hidden Markov models (HMMs). To determine the phoneme similarity
among our target languages, we reported two methods of phoneme mapping. Statistical phoneme-based
bigram language models (LM) are integrated into speech decoding to eliminate possible phone
mismatches. The supervised support vector machine (SVM) is used to learn to recognize the phonetic
information of mixed-language speech based on recognized phone sequences. As the back-end decision is
taken by an SVM, the likelihood scores of segments with monolingual phone occurrence are used to
classify language identity. The speech corpus was tested on Sepedi and English languages that are often
mixed. Our system is evaluated by measuring both the ASR performance and the LID performance
separately. The systems have obtained a promising ASR accuracy with data-driven phone merging
approach modelled using 16 Gaussian mixtures per state. In code-switched speech and monolingual
speech segments respectively, the proposed systems achieved an acceptable ASR and LID accuracy.
Transliteration by orthography or phonology for hindi and marathi to english ...ijnlc
e-Governance and Web based online commercial multilingual applications has given utmost importance to
the task of translation and transliteration. The Named Entities and Technical Terms occur in the source
language of translation are called out of vocabulary words as they are not available in the multilingual
corpus or dictionary used to support translation process. These Named Entities and Technical Terms need
to be transliterated from source language to target language without losing their phonetic properties. The
fundamental problem in India is that there is no set of rules available to write the spellings in English for
Indian languages according to the linguistics. People are writing different spellings for the same name at
different places. This fact certainly affects the Top-1 accuracy of the transliteration and in turn the
translation process. Major issue noticed by us is the transliteration of named entities consisting three
syllables or three phonetic units in Hindi and Marathi languages where people use mixed approach to
write the spelling either by orthographical approach or by phonological approach. In this paper authors
have provided their opinion through experimentation about appropriateness of either approach.
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...kevig
This study extracted and analyzed the linguistic speech patterns that characterize Japanese anime or game characters. Conventional morphological analyzers, such as MeCab, segment words with high performance, but they are unable to segment broken expressions or utterance endings that are not listed in the dictionary, which often appears in lines of anime or game characters. To overcome this challenge, we propose segmenting lines of Japanese anime or game characters using subword units that were proposed mainly for deep learning, and extracting frequently occurring strings to obtain expressions that characterize their utterances. We analyzed the subword units weighted by TF/IDF according to gender, age, and each anime character and show that they are linguistic speech patterns that are specific for each feature. Additionally, a classification experiment shows that the model with subword units outperformed that with the conventional method.
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...ijnlc
In this paper we will present our work in studying the sublanguage of Arabic SMS-based classified ads.
This study is presented from the developer's point of view. We will use the corpus collected from an
operational system, CATS. We also compare the SMS-based and the Web-based messages. We also discuss
some quantitative properties of the studied text.
A Novel Approach for Recognizing Text in Arabic Ancient Manuscripts ijnlc
In this paper a system for recognizing Arabic ancient manuscripts is presented. The system has been
divided into four parts. The first part is the image pre-processing where the text in the Arabic ancient
manuscript will be recognized as a collection of Arabic characters through three phases of processing. The
second part is the Arabic text analysis which consists of lexical analyzer; syntax analyzer; and semantic
analyzer. The output of this subsystem is an XML file format that represents the ancient manuscript text.
The third part is the intermediate text generation, in this part an intermediate presentation of the Arabic
text is generated from the XML text file. The fourth part of the system is the Arabic text generation, which
converts the generated text to a modern standard Arabic (MSA) language (this part has four phases: text
organizer; pre-optimizer; semantics generator; and post-optimizer).
In this paper firstly I have compared Single Label Text Categorization with Multi Label Text Categorization in detail then I have compared Document Pivoted Categorization with Category Pivoted Categorization in detail. For this purpose I have given the general definition of Text Categorization with its mathematical notation for the purpose of its frugality and cost effectiveness. Then with the help of mathematical notation and set theory ,I have converted the general definitions of Single Label Text Categorization and Multi Label Text Categorization into their respective mathematical representation .Then I discussed Binary Text Categorization as a special case of Single Label Text Categorization. After comparison of Single Label Text Categorization with Multi Label Text Categorization, I found that Single Label Text Categorization or Binary Text Categorization is more general than Multi Label Text Categorization. Thereafter I discussed an algorithm for transformation of Multi Label Classification into Binary Classification and explained the conditions of transformation of Multi Label Classification into Binary Classification. In the second step I compared Document Pivoted Categorization with Category Pivoted Categorization in detail. After comparison we found that Category Pivoted Categorization is more typical and complex than Document Pivoted Categorization. The Category Pivoted Categorization becomes more complicated when new category is added to predefined set of categories and the recurrent classification of documents takes place. Finally I compared Hard Categorization with Ranking Categorization. After comparing them I found that Hard Categorization incorporates ‘Hard Decisions’ about the relevance or belonging of a document to a category. This hard decision is either completely true or completely false. Whereas the Ranking Categorization creates a belonging of a document to a category
according to the estimated appropriateness to the document. The final Ranked List is developed in the Ranking Categorization which is used by the human expert for final decision of Text Categorization.
A SIGNATURE BASED DRAVIDIAN SIGN LANGUAGE RECOGNITION BY SPARSE REPRESENTATIONijnlc
Sign language is a visual-gestural language used by deaf-dumb people for communication. As normal people are unfamiliar of sign language, the hearing-impaired people find it difficult to communicate with them. The communication gap between the normal and the deaf-dumb people can be bridged by means of Human–Computer Interaction. The objective of this paper is to convert the Dravidian (Tamil) sign language into text. The proposed method recognizes 12 vowels, 18 consonants and a special character “Aytham” of Tamil language by a vision based approach. In this work, the static images of the hand signs are obtained a web/digital camera. The hand region is segmented by a threshold applied to the hue channel of the input image. Then the region of interest (i.e. from wrist to fingers) is segmented using the reversed horizontal projection profile and the Discrete Cosine transformed signature is extracted from the boundary of hand sign. These features are invariant to translation, scale and rotation. Sparse representation classifier is incorporated to recognize 31 hand signs. The proposed method has attained a maximum recognition accuracy of 71% in a uniform background.
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES ijnlc
Stemming is the process of term conflation. It conflates all the word variants to a common form called as stem. It plays significant role in numerous Natural Language Processing (NLP) applications like morphological analysis, parsing, document summarization, text classification, part-of-speech tagging, question-answering system, machine translation, word sense disambiguation, information retrieval (IR), etc. Each of these tasks requires some pre-processing to be done. Stemming is one of the important building blocks for all these applications. This paper, presents an overview of various stemming techniques, evaluation criteria for stemmers and various existing stemmers for Indic languages.
Arabic morphology encapsulates many valuable features such as word’s root. Arabic roots are beingutilized for many tasks; the process of extracting a word’s root is referred to as stemming. Stemming is anessential part of most Natural Language Processing tasks, especially for derivative languages such asArabic. However, stemming is faced with the problem of ambiguity, where two or more roots could beextracted from the same word. On the other hand, distributional semantics is a powerful co-occurrence
model. It captures the meaning of a word based on its context. In this paper, a distributional semantics
model utilizing Smoothed Pointwise Mutual Information (SPMI) is constructed to investigate itseffectiveness on the stemming analysis task. It showed an accuracy of 81.5%, with a at least 9.4%improvement over other stemmers.
M ACHINE T RANSLATION D EVELOPMENT F OR I NDIAN L ANGUAGE S A ND I TS A PPROA...ijnlc
This paper presents a survey of Machine translation system for Indian Regional languages. Machine
translation is one of the central areas of Natural language processing (NLP).
Machine translation
(henceforth
referred as MT)
is important for breaking the language barrier and facilitating inter
-
lingual
communication. For a multilingual country like INDIA which is largest democratic country in whole world,
there is a big requirement of automatic machine translation system.
With
the advent of Information
Technology many documents and web pages are coming
up in a local language so
there is
a large need of
good M
T
systems to address all these issue
s in order to establish a
proper
communication between states
and union governments to
exchange information amongst the people of different states.
This paper focuses
on different Machine translation projects done in India along with their features and domain
T URN S EGMENTATION I NTO U TTERANCES F OR A RABIC S PONTANEOUS D IALOGUES ...ijnlc
ext segmentation task is an essential processing task for many of Natural Language Processing (NLP)
such as text summarization, text translation, dialogue language understanding, among others. Turns
segmentation consi
dered the key player in dialogue understanding task for building automatic Human
-
Computer systems. In this paper, we introduce a novel approach to turn segmentation into utterances for
Egyptian spontaneous dialogues and Instance Messages (IM) using Machine
Learning (ML) approach as a
part of automatic understanding Egyptian spontaneous dialogues and IM task. Due to the lack of Egyptian
dialect
dialogue
corpus
the system evaluated by our
corpus
includes 3001 turns, which
are collected,
segmented, and annotat
ed manually from Egyptian call
-
centers. The system achieves F
1
scores
of 90.74%
and accuracy of 95.98%
Text mining is a new and exciting research area that tries to solve the information overload problem by using techniques from machine learning, natural language processing (NLP), data mining, information retrieval (IR), and knowledge management. Text mining involves the pre-processing of document collections such as information extraction, term extraction, text categorization, and storage of intermediate representations. The techniques that are used to analyse these intermediate representations such as clustering, distribution analysis, association rules and visualisation of the results.
Contextual Analysis for Middle Eastern Languages with Hidden Markov Modelsijnlc
Displaying a document in Middle Eastern languages requires contextual analysis due to different presentational forms for each character of the alphabet. The words of the document will be formed by the joining of the correct positional glyphs representing corresponding presentational forms of the
characters. A set of rules defines the joining of the glyphs. As usual, these rules vary from language to language and are subject to interpretation by the software developers.
Interdisciplinary simulation training for obstetric and neonatal emergency co...MCH-org-ua
Presentation by Victor Petrov at the International conference on Simulation-based training in medicine (Kyiv, Ukraine, March 19-20, 2015)
http://motherandchild.org.ua/eng/SimConf-2015
Named Entity Recognition for Telugu Using Conditional Random FieldWaqas Tariq
Named Entity (NE) recognition is a task in which proper nouns and numerical information are extracted from documents and are classified into predefined categories such as Person names, Organization names , Location names, miscellaneous(Date and others). It is a key technology of Information Extraction, Question Answering system, Machine Translations, Information Retrial etc. This paper reports about the development of a NER system for Telugu using Conditional Random field (CRF). Though this state of the art machine learning technique has been widely applied to NER in several well-studied languages, the use of this technique to Telugu languages is very new. The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the four different named entities (NE) classes, such as Person name, Location name, Organization name, miscellaneous (Date and others). Keywords: Named entity, Conditional Random field, NE, CRF, NER, named entity recognition
This paper presents a rule based model of parts of speech (POS) tagset for Classical Tamil Texts (CTT). The noun forms are type pattern, verb forms are token pattern. This is based on form agreement method. This is a very efficient and novel approach because Tamil Language has a build-in system of agreement/concord of the sentence. Classical Tamil Tagset is divided into two basic classifications, noun morphology and verb morphology.
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMERkevig
Morphological stemming becomes a critical step toward natural language processing. The process of
stemming is to reduce alternative forms to a common morphological root. Word segmentation for
Myanmar Language, like for most Asian Languages, is an important task and extensively-studied
sequence labelling problem. Named entity detection is one of the issues in Asian Language that has
traditionally required a large amount of feature engineering to achieve high performance. The new
approach is integrating them that would benefit in all these processes. In recent years, end-to-end
sequence labelling models with deep learning are widely used. This paper introduces a deep BiGRUCNN-CRF network that jointly learns word segmentation, stemming and named entity recognition tasks.
We trained the model using manually annotated corpora. State-of-the-art named entity recognition
systems rely heavily on handcrafted feature built in our new approach, we introduce the joint model that
relies on two sources of information: character level representation and syllable level representation.
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMERijnlc
Morphological stemming becomes a critical step toward natural language processing. The process of stemming is to reduce alternative forms to a common morphological root. Word segmentation for Myanmar Language, like for most Asian Languages, is an important task and extensively-studied sequence labelling problem. Named entity detection is one of the issues in Asian Language that has traditionally required a large amount of feature engineering to achieve high performance. The new approach is integrating them that would benefit in all these processes. In recent years, end-to-end sequence labelling models with deep learning are widely used. This paper introduces a deep BiGRUCNN-CRF network that jointly learns word segmentation, stemming and named entity recognition tasks. We trained the model using manually annotated corpora. State-of-the-art named entity recognition systems rely heavily on handcrafted feature built in our new approach, we introduce the joint model that relies on two sources of information: character level representation and syllable level representation.
STUDY OF NAMED ENTITY RECOGNITION FOR INDIAN LANGUAGESijistjournal
Named Entity Recognition is a prior task in Natural Language Processing. Named Entity Recognition is a sub task of information extraction and it identifies and classifies proper nouns in to its predefined categories such as person, location, organization, time, date etc. In this document the major focus is given on NER approaches and the work done till now for various languages to identify Named Entities is been discussed. Author have done comparative study to recognize named entity and identified that CRF approach proven best for Indian languages to identify named entity.
Domain Specific Named Entity Recognition Using Supervised ApproachWaqas Tariq
This paper introduces Named Entity Recognition approach for textual corpus. Supervised Statistical methods are used to develop our system. Our system can be used to categorize NEs belonging to a particular domain for which it is being trained. As Named Entities appears in text surrounded by contexts (words that are left or right of the NE), we will be focusing on extracting NE contexts from text and then perform statistical computing on them. We are using n-gram modeling for extracting contexts from text. Our methodology first extracts left and right tri-grams surrounding NE instances in the training corpus and calculate their probabilities. Then all the extracted tri-grams along with their calculated probabilities are stored in a file. During testing, system detects unrecognized NEs in the testing corpus and categorize them using the tri-gram probabilities calculated during training time. The proposed system consists of two modules namely Knowledge acquisition and NE Recognition. Knowledge acquisition module extracts the tri-grams surrounding NEs in the training corpus and NE Recognition module performs the categorization of Named Entities in the testing corpus.
A study on the approaches of developing a named entity recognition tooleSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Automatic classification of bengali sentences based on sense definitions pres...ijctcm
Based on the sense definition of words available in the Bengali WordNet, an attempt is made to classify the
Bengali sentences automatically into different groups in accordance with their underlying senses. The input
sentences are collected from 50 different categories of the Bengali text corpus developed in the TDIL
project of the Govt. of India, while information about the different senses of particular ambiguous lexical
item is collected from Bengali WordNet. In an experimental basis we have used Naive Bayes probabilistic
model as a useful classifier of sentences. We have applied the algorithm over 1747 sentences that contain a
particular Bengali lexical item which, because of its ambiguous nature, is able to trigger different senses
that render sentences in different meanings. In our experiment we have achieved around 84% accurate
result on the sense classification over the total input sentences. We have analyzed those residual sentences
that did not comply with our experiment and did affect the results to note that in many cases, wrong
syntactic structures and less semantic information are the main hurdles in semantic classification of
sentences. The applicational relevance of this study is attested in automatic text classification, machine
learning, information extraction, and word sense disambiguation
MODIFIED PAGE RANK ALGORITHM TO SOLVE AMBIGUITY OF POLYSEMOUS WORDSIJCI JOURNAL
Ambiguity in natural language has an effect on Information Retrieval (IR) system in general and web
search system in particular. Word sense ambiguity is the reason behind the poor performance of IR system.
If the ambiguous words are correctly disambiguated then IR performance can increase. WSD is defined as
the task of identifying the sense of word in textual context. Word Sense Disambiguation (WSD) is the
central problem of language processing. WSD improves the IR in many ways. Current IR system do not use
explicit WSD and rely on user typing enough content in the query to only retrieve documents relevant to the
intended senses. This paper proposes an algorithm for efficient retrieval of information on the Web
according to user’s need. Results are more accurate and they are according to user’s preference.
Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...Nischal Lal Shrestha
This study investigates the feasibility of identifying gender via personal names written in the Devanagari
script. While extensive research exists for languages like English, Russian, and Brazilian, little attention
has been given to low-resource language like Nepali. This study addresses this gap, aiming to enhance
gender classification in Nepali names. We compare traditional machine learning algorithms along
with advanced contextualized transformer architectures like variants of BERT (Bidirectional Encoder
Representations from Transformers) specifically fine-tuned for detecting gender among individuals with
Nepali names. Our experiments aim at exploring how effectively these techniques capture linguistic
nuances unique to the Nepali language and assess their overall performance in terms of accuracy,
precision, recall, and F1 score metrics. Both experiments reveal superior performance by the BERT
variants DeBERTa and DistilBERT, demonstrating the merits of advanced NLP techniques for gender
classification in Nepali. The study also show that the additional encoding of name endings (last
character of a name) improve the performance of traditional ML algorithms by 10%. Through this
work, we contribute to gender classification methodologies tailored specifically for the linguistic nuances
of the Nepali language.
A survey of named entity recognition in assamese and other indian languagesijnlc
Named Entity Recognition is always important when dealing with major Natural Language Processing
tasks such as information extraction, question-answering, machine translation, document summarization
etc so in this paper we put forward a survey of Named Entities in Indian Languages with particular
reference to Assamese. There are various rule-based and machine learning approaches available for
Named Entity Recognition. At the very first of the paper we give an idea of the available approaches for
Named Entity Recognition and then we discuss about the related research in this field. Assamese like other
Indian languages is agglutinative and suffers from lack of appropriate resources as Named Entity
Recognition requires large data sets, gazetteer list, dictionary etc and some useful feature like
capitalization as found in English cannot be found in Assamese. Apart from this we also describe some of
the issues faced in Assamese while doing Named Entity Recognition.
KANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATIONijnlc
Named Entity Recognition and Classification (NERC) is a process of identification of proper nouns in the text and classification of those nouns into certain predefined categories like person name, location,organization, date, and time etc. NERC in Kannada is an essential and challenging task. The aim of this work is to develop a novel model for NERC, based on Multinomial Naïve Bayes (MNB) Classifier. The Methodology adopted in this paper is based on feature extraction of training corpus, by using term frequency, inverse document frequency and fitting them to a tf-idf-vectorizer. The paper discusses the
various issues in developing the proposed model. The details of implementation and performance evaluation are discussed. The experiments are conducted on a training corpus of size 95,170 tokens and test corpus of 5,000 tokens. It is observed that the model works with Precision, Recall and F1-measure of
83%, 79% and 81% respectively.
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
In recent decades speech interactive systems have gained increasing importance. Performance of an ASR system mainly depends on the availability of large corpus of speech. The conventional method of building a large vocabulary speech recognizer for any language uses a top-down approach to speech. This approach requires large speech corpus with sentence or phoneme level transcription of the speech utterances. The transcriptions must also include different speech order so that the recognizer can build models for all the sounds present. But, for Telugu language, because of its complex nature, a very large, well annotated speech database is very difficult to build. It is very difficult, if not impossible, to cover all the words of any Indian language, where each word may have thousands and millions of word forms. A significant part of grammar that is handled by syntax in English (and other similar languages) is handled within morphology in Telugu. Phrases including several words (that is, tokens) in English would be mapped on to a single word in Telugu.Telugu language is phonetic in nature in addition to rich in morphology. That is why the speech technology developed for English cannot be applied to Telugu language. This paper highlights the work carried out in an attempt to build a voice enabled text editor with capability of automatic term suggestion. Main claim of the paper is the recognition enhancement process developed by us for suitability of highly inflecting, rich morphological languages. This method results in increased speech recognition accuracy with very much reduction in corpus size. It also adapts Telugu words to the database dynamically, resulting in growth of the corpus.
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
Natural Language Processing (NLP) techniques are one of the most used techniques in the field of computer applications. It has become one of the vast and advanced techniques. Language is the means of communication or interaction among humans and in present scenario when everything is dependent on machine or everything is computerized, communication between computer and human has become a necessity. To fulfill this necessity NLP has been emerged as the means of interaction which narrows the gap between machines (computers) and humans. It was evolved from the study of linguistics which was passed through the Turing test to check the similarity between data but it was limited to small set of data. Later on various algorithms were developed along with the concept of AI (Artificial Intelligence) for the successful execution of NLP. In this paper, the main emphasis is on the different techniques of NLP which have been developed till now, their applications and the comparison of all those techniques on different parameters.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Welocme to ViralQR, your best QR code generator.ViralQR
Welcome to ViralQR, your best QR code generator available on the market!
At ViralQR, we design static and dynamic QR codes. Our mission is to make business operations easier and customer engagement more powerful through the use of QR technology. Be it a small-scale business or a huge enterprise, our easy-to-use platform provides multiple choices that can be tailored according to your company's branding and marketing strategies.
Our Vision
We are here to make the process of creating QR codes easy and smooth, thus enhancing customer interaction and making business more fluid. We very strongly believe in the ability of QR codes to change the world for businesses in their interaction with customers and are set on making that technology accessible and usable far and wide.
Our Achievements
Ever since its inception, we have successfully served many clients by offering QR codes in their marketing, service delivery, and collection of feedback across various industries. Our platform has been recognized for its ease of use and amazing features, which helped a business to make QR codes.
Our Services
At ViralQR, here is a comprehensive suite of services that caters to your very needs:
Static QR Codes: Create free static QR codes. These QR codes are able to store significant information such as URLs, vCards, plain text, emails and SMS, Wi-Fi credentials, and Bitcoin addresses.
Dynamic QR codes: These also have all the advanced features but are subscription-based. They can directly link to PDF files, images, micro-landing pages, social accounts, review forms, business pages, and applications. In addition, they can be branded with CTAs, frames, patterns, colors, and logos to enhance your branding.
Pricing and Packages
Additionally, there is a 14-day free offer to ViralQR, which is an exceptional opportunity for new users to take a feel of this platform. One can easily subscribe from there and experience the full dynamic of using QR codes. The subscription plans are not only meant for business; they are priced very flexibly so that literally every business could afford to benefit from our service.
Why choose us?
ViralQR will provide services for marketing, advertising, catering, retail, and the like. The QR codes can be posted on fliers, packaging, merchandise, and banners, as well as to substitute for cash and cards in a restaurant or coffee shop. With QR codes integrated into your business, improve customer engagement and streamline operations.
Comprehensive Analytics
Subscribers of ViralQR receive detailed analytics and tracking tools in light of having a view of the core values of QR code performance. Our analytics dashboard shows aggregate views and unique views, as well as detailed information about each impression, including time, device, browser, and estimated location by city and country.
So, thank you for choosing ViralQR; we have an offer of nothing but the best in terms of QR code services to meet business diversity!
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ISSUES AND CHALLENGES IN MARATHI NAMED ENTITY RECOGNITION
1. International Journal on Natural Language Computing (IJNLC) Vol. 5, No.1, February 2016
DOI: 10.5121/ijnlc.2016.5102 15
ISSUES AND CHALLENGES IN MARATHI NAMED
ENTITY RECOGNITION
NITA PATIL, AJAY S. PATIL AND B. V. PAWAR
School of Computer Sciences, North Maharashtra University, Jalgaon
ABSTRACT
Information Extraction (IE) is a sub discipline of Artificial Intelligence. IE identifies information in
unstructured information source that adheres to predefined semantics i.e. people, location etc. Recognition
of named entities (NEs) from computer readable natural language text is significant task of IE and natural
language processing (NLP). Named entity (NE) extraction is important step for processing unstructured
content. Unstructured data is computationally opaque. Computers require computationally transparent
data for processing. IE adds meaning to raw data so that it can be easily processed by computers. There
are various different approaches that are applied for extraction of entities from text. This paper elaborates
need of NE recognition for Marathi and discusses issues and challenges involved in NE recognition tasks
for Marathi language. It also explores various methods and techniques that are useful for creation of
learning resources and lexicons that are important for extraction of NEs from natural language
unstructured text.
KEYWORDS
Named Entity Recognition, Information Extraction, Issues, Challenges, Marathi, Techniques & Resources
1. INTRODUCTION
Named Entity Recognition (NER) is one of the important sub tasks of Information Extraction
(IE). Named Entities (NEs) are noun phrases in the natural language text. Natural language text is
a sequence of sentences, where sentence in turn is a sequence of words and punctuations
combined to add semantic to the text. Further, a word is a character sequence. In NER the aim is
to distinguish between character sequence that represent noun phrases and character sequence that
represents normal text. A proper noun in the text that is used to refer person, company, location
etc. is called as Named Entity (NE). The main types of NEs are shown in Fig. 1
Figure 1. Types of Named Entities
2. International Journal on Natural Language Computing (IJNLC) Vol. 5, No.1, February 2016
16
Proper nouns play vital role in discovery of semantics hidden in the text. Identification of proper
nouns in the raw text and classification of identified proper nouns in appropriate category is an
important sub task of information extraction called as Named Entity Recognition (NER). Any
entity such as person, organization, geographical location, government agency, event that has a
name is treated as a named entity. Expressions such as money, percentage, phone numbers, date,
time, URLs, addresses are also treated as named entities even if they are not exactly like
traditional proper nouns [1]. Marking NEs in natural language text is significant pre-processing
step useful for NLP applications like business intelligence, summarizations, machine translations
and question answering systems etc. NER is a significantly difficult task. Example given below
depicts many complexities involved in NER task.
[PER: Washington’s] daughter and [ORG: GM] chief [PER: Mary Barra] apologized for the
[LOC: US] automaker's failure in [LOC: Washington] Metropolitan Area in [LOC: United
States].
PER, ORG and LOC labels in above sentence represent Person, Organization and Location
entities respectively. This sentence contains six instances of named entities that show many
complexities related to NER problem. The example sentence includes first instance of the name
Washington which is used to refer person whereas second instance Washington is a location
name, which is ambiguous. GM is an acronym for the General Motors, which is a well known car
manufacturer in US. Mary Barra is a two word chunk that indicates a single person. Further, US
and United States are two different chunks that should be identified with same label LOC.
Indian languages have originated from Sanskrit whereas Indic scripts have originated from
Brahmi. In ancient India, Sanskrit was spoken by Brahmins and Prakrit was the language used in
general by common people. Marathi has originated from Prakrit. Marathi is the official language
of Maharashtra state in India and is recognized as one of the twenty-two official languages of the
country by Constitution of India. Very limited Indian population can either read or write English.
Over 90 percent population is normally far from taking benefits of English based Information
Technology. As International boundaries are becoming narrower day by day, language should not
be barrier for the common people to seek information. Importance of natural language processing
is dramatically growing. A big enhancement in information technology is agreed to involve
semantic search. Similar to other Indian languages (Bengali, Hindi, Telugu) NLP tool
development is necessary to enrich the Marathi language technology.
2. RELATED WORK
Research in NER has been surveyed by various researchers from time to time from various
aspects. Significant work in NER began after 1987. Survey of the progress in NER during 1991 to
2006 can be found in David et. al.[2]. Their study has reported various NE types, domains, and
genres, various learning methods of NER and performance evaluation techniques in detail. Their
work has also explored word-level, dictionary-level and corpus-based features of text in natural
language. Khaled Shaalan [3] describes the increase in interest and progress made in Arabic NER
research, the importance of the NER task, the main characteristics of the Arabic language, the
aspects of standardization in annotating named entities, linguistic resources, approaches, features
of common tools and standard evaluation metrics used in Arabic NER. Sasidhar et. al. [4] studied
named entity recognition and classification (NERC) and presented various NERC system
development approaches and NERC system evaluation techniques with respect to Indian
languages. They highlighted that research in NER is difficult for Indian languages. Padmaja et.
al.[5] had presented a overview of NER , issues in the Indian languages, different approaches
3. International Journal on Natural Language Computing (IJNLC) Vol. 5, No.1, February 2016
17
used in NER and research in NER in different Indian languages like Bengali, Telugu, Hindi,
Oriya and Urdu along with the methodologies used. The objectives of the survey mentioned in
this paper are to assimilate resources useful for development of NERC system for Indian
language Marathi, to explore challenges in named entity extraction from Marathi text, to study
techniques used by other researchers and to investigate steps involved in development of
resources required to pinpoint suitable technique for Marathi language NERC system.
3. ISSUES AND CHALLENGES IN MARATHI NER
Implementation of Marathi NER system is challenging because of difficulties introduced in
recognition. NER is very challenging task. It becomes much more difficult and challenging for
Marathi language. The issues elaborated here depict the major challenges in Marathi NER.
3.1. Ambiguities in named entity classes
Natural language text is simply a series of language character set. It is very hard to distinguish
between the characters that represent proper nouns and characters that represent normal text. It is
also challenging to classify the similar sequence of words representing different classes in
different instances. Ambiguities in names containing words that infer multiple meanings or a
chunk that can be part of different types of NEs makes NER difficult. Word Sense
Disambiguation (WSD) is required to provide inference ability to a system to determine that a
chunk is actually a named entity or to determine the classification of a named entity. The
following example depicts the overlap of person and organization entity class. The words दनानाथ
मंगेशकर overlaps both in the person and organization entity class.
[PER: दनानाथ मंगेशकर] यां या मृती यथ [ORG: दनानाथ मंगेशकर मेमोर यल] म ये आयोजीत
काय मात [PER: लता मंगेशकर] यांनी गीत गायले.
In program sponsored by [ORG: Dinanath Mangeshkar Memorial] [PER: Lata Mangeshkar] sang
a song in memory of [PER: Dinanath Mangeshkar.]
3.2. Abbreviations and non local dependencies
Multiple tokens can be written in different ways such as abbreviations or long form, usually first
instance with descriptive long formulation followed by instances with short forms or aliases. Such
tokens sometimes require same label assignments or require cross referencing. This ability of a
system refers to non local dependency. External knowledge is required to deal with non local
dependencies. Construction of external knowledge including names lists and expansive lexicons
is not easy since domain lexicons and names are continuously expanding. Named entities can be
composed of single or multiple words chunk of text. Parsing prediction or name chunking model
is required to predict whether consecutive multiple words belong to same entity. The following
example depicts the issues mentioned above regarding Marathi text.
[MISC: आंतर व!यालय जलतरण पध%त] [NUM: चार] [Full form of token: िज(हे] सहभागी झाले.
[Multiword PER: अ.नके त अजय पाट ल] याने [MEASURE: थम] पा0ऱतो षक पटकावले. [Alias PER:
अ.नके त] हा स3ट जोसेफ [Short form of word: िज.] जळगाव शाळेचा व!याथ6 आहे.
4. International Journal on Natural Language Computing (IJNLC) Vol. 5, No.1, February 2016
18
[NUM: Four] districts participated in [MISC: interschool swimming competition]. [PER: Aniket
Ajay Patil won [MEASURE: first] prize. [PER: Aniket] is student of [ORG: St. Joseph School,
Jalgaon]
3.3. Foreign words
Foreign words appear in Marathi texts which are spelled in Devanagari script. Table 1 shows
some instances of person, organization, location and miscellaneous names that are English words
spelled in Devanagari script. It is very challenging to recognize such words. It is very difficult to
create gazetteers that include such names because they are not limited.
Table 1. Named entities containing foreign words
Person Organization Location Miscellaneous
डेि8हड धवन
(David Dhawan)
कू ल ऑफ आ;स
(School of Arts)
यूपी
(UP)
वन व(ड फे लो<शप क=म
(One World Fellowship Scheme)
ए. के . पुरवार
(A. K. Purvar)
टु@डओ <लंक
(Studio Link)
कॅ नडा
(Canada)
CयुिDलअस बजेट योजना
(Nucleus Budget Yojana)
व(Fे ड @डसूझा
(Wilfred D’souza)
एनएफडीसी
(NFDC)
इंHलंड
(England)
बॉJबे टू मॉ0रशस
(Bombay to Mauritius)
3.4. Agglutinative and inflectional nature of Marathi
Marathi is agglutinative language. Unlike English prefixes and suffixes are added to root words in
Marathi to form meaningful contexts. Sometimes in Marathi when suffixes or prefix are added to
root word it changes it’s semantic also. The word forms such as अबोल, .नंबोल are formed when
prefix is added to word बोल (to talk). The agglutinative nature of Marathi language can be seen in
table 2 that shows some variations of word बोल (from the IIT Bombay FIRE-2010 corpus) after
adding suffix/prefix to it.
Table 2. Variations of word बोल
बोलू बोलता बोलह बोलKयाने बोललात बोलणाLयाला बोलवायला
बोल बोलती बोल न बोलKयाला बोललास बोलKयातील बोल वKयात
बोलक= बोलते बोलून बोलKयास बोलवणे बोलKयातून बोल वलेल
बोलके बोलतो बोलणार बोलतच बोलवत बोलKयामुळे बोल व(या
बोलDया बोललं बोलणारे बोलतांना बोलवावी बोलKयावर बोल(याबMल
बोलणं बोलला बोलणाLया बोलतात बोल वता बोलतानाह अबोलपणानं
बोलणी बोलल बोलKयाचा बोलताना बोल वल बोलपटात बोलKयावNन
बोलणे बोलले बोलKयाची बोलताह बोल वले बोल(यावर बोलता-बोलता
बोलतं बोललो बोलKयाचे बोलबाला बोलवून बोलवKयात अबोल
बोलत बोल(या बोलKयात बोलभांड बोलणाLयांची बोलवायचं .नंबोल
It is very difficult to use gazetteers, dictionaries, similarity measurement and pattern matching
techniques to recognize Marathi names. Dictionaries or gazetteers contain entities without any
suffix added. In Marathi suffixes are added to words in order to create the meaningful context.
5. International Journal on Natural Language Computing (IJNLC) Vol. 5, No.1, February 2016
19
Table 3. Named entities instances and actual gazetteer Record
Named Entity Gazetteer Entry
मधुबाला या (Madhubala’s) मधुबाला
िPलंटॉफकरवी (from Flintop) िPलंटॉफ
खानदेश एRयुके शन सोसायट चं
(of Khandesh Education Society)
खानदेश एRयुके शन सोसायट
राSTवाद काँVेसमधील
(in Rashtravadi Congress)
राSTवाद काँVेस
अथेCसम ये (in Athens) अथेCस
वाशीहून (from Washi) वाशी
Table 3 illustrates the difficulties found in use of list lookup and pattern matching technique for
Marathi text. A well written stemmer is required for morphologically rich language Marathi to
separate the root from the suffix in order to compare the word forms with gazetteer or dictionary
entries. It cannot be claimed that stemming will solve the problem completely because adding
suffixes to roots may change the grammatical category of the root word, which may result in
wrong entity recognition. Table 4 depicts how entity class changes after adding suffix to the
tokens.
Table 4. Named entities class before and after stemming
Entity class before stemming Entity class after stemming
गरवारे महा व!यालयाजवळच (near Garware College) गरवारे महा व!यालय
थाई ऍड8हसर Dलबवर (on Thai Adversary Club) थाई ऍड8हसर Dलब
कोपर भाग कायालयाजवळील (near Kopari Divisional Office) कोपर भाग कायालय
पुणे कॅ CटोCम3टमधून सरळ (straight to Pune Cantonment) पुणे कॅ CटोCम3ट
दामोदर हॉल यामागे (behind Damodar Hall) दामोदर हॉल
3.5. Spelling variations
In Marathi, words containing the four vowels इ (i), vowel sign (ि◌), or ई ( ī), vowel sign (◌ी), उ
(u), vowel sign (◌ु), ऊ (ū), vowel sign(◌ू) do not make phonetic difference but differs in writing
and spellings. The words such as आ_ण (grammatically correct) and आणी (grammatically
incorrect) or पाणी (grammatically correct) and पानी (grammatically incorrect) are interchangeably
used in many writings. Marathi grammar may not be followed completely in free style text
writing. This affects text recognition systems. There is lack of uniformity in writing spellings in
Marathi. Some Marathi words shown in Table 5 can be written in either or other form in text
which may be used alternately to point same thing.
6. International Journal on Natural Language Computing (IJNLC) Vol. 5, No.1, February 2016
20
Table 5. Lack of uniformity in writing
Word form1 Word form2
एनिDव टकडून एनDवी टकडुन
भु.तयाखेर ज भुतीयाखे0रज
मुंबई टाइJस मुंबई टाईJस
काि`मर का`मीर
हंगेर या हंगे0र या
3.6. Encoding related issues
Each language uses its own character set in legacy encodings. Characters written using one native
character encoding may not be displayed correctly by another encoding system [6]. Marathi text
is encoded using various fonts and systems. If a document is opened on computer that does not
support the font or system using which the document is written, then text in document is
displayed with unreadable characters and becomes unusable. Sometimes document created using
one computer with a particular operating system cannot fully display the text on computer with
same configuration but with another operating system. In such situation some characters are
readable but some are displayed using hollow circles or wrongly spelled. This is called as partly
corrupted text. Table 6 shows some instances of text typed on one computer that appeared
different on another.
Table 6. Problems with charter encodings
Incorrect version Correct version
क3 द क3 a
श्◌ान् `न
वलेपालेर् वलेपाल%
इंHलंडतफे र् इंHलंडतफ%
मं◌ुबई मुंबई
3.7. Dialects
Marathi is spoken using many dialects such as standard Marathi, Warhadi, Ahirani, Dangi,
Vadvali, Samavedi, Khandeshi, and Malwani in various regions of India. There are specific words
used in each dialect to express the text. Words from different dialects also appear in Marathi text.
Some regional words, similar word used in standard Marathi and its meaning in English are
illustrated in Table 7.
7. International Journal on Natural Language Computing (IJNLC) Vol. 5, No.1, February 2016
21
Table 7. Variations in dialects in Marathi
Dialect Text Dialect Standard Marathi Meaning in English
Ahirani सैपाक वंयपाक to cook
Warhadi घेउन eया eया take
Warhadi _खल वले खाऊ घातले feed
Warhadi काहून? का? why?
Warhadi लोचा सम या problem
Konkani उबी फू टपgी अनुलंब मापनी vertical ruler
Khandeshi वावरात शेतात in farm
Annotated corpus, name gazetteers, dictionaries, morphological analyzers, POS taggers etc. are
not available in Marathi in required quantity and quality. Marathi is relatively free order
language.
4. METHODOLOGIES AND TECHNIQUES
A mechanism that can lookup, analyze and extract patterns required to distinguish noun phrases
from the normal text is needed. Every word has characteristics that describe and designate it.
Such characteristics are called as features. Features describing word can be used for identification
and classification of NE’s. The main techniques for named entity recognition are word-level
feature based, list or dictionary lookup and corpus based recognition. Name finding is based on
one of the following approaches:(1) hand-crafted or automatically acquired rules or finite state
patterns (2) look up from large name lists or other specialized resources (3) data driven
approaches exploiting the statistical properties of the language (statistical models) [7].
NER involves three significant subtasks. First is tokenization that is segmentation of text into
tokens. Second is assignment of appropriate tag to segmented token. This task may result in
ambiguous assignment of tags to the tokens. Third subtask is selection of correct tag for a token
by resolving ambiguity introduced by second step. Tokenization includes identification of
sentence and word form boundaries, segmenting text into tokens, marking the word boundaries
and resolving ambiguities while segmenting the text into tokens. Next to tokenization is lexicon
creation. Lexicon is a token with potential features and properties. Lexicons can be full formed,
morphological or with statistical information. Most preferable way of lexicons for NE tagging
generation is either using morphological analysis or derived from large corpus. Rule based NER
uses morphological lexicons whereas machine learning techniques need statistical probabilities
associated with token and tag pairs.
4.1. Word-level feature based recognition
Word level features describe attributes of individual tokens. These properties further can be used
to identify token if it is NE and classify it into more appropriate NE class. Typical word level
features useful for NER tasks are shown in fig. 2.
8. International Journal on Natural Language Computing (IJNLC) Vol. 5, No.1, February 2016
22
Figure 2. Word Level Features for NER
Lexicons used for rule based NER can be developed using morphological analysis and POS
tagging. Following are the techniques that mainly focus on tokens, describes them with features
that are helpful for NERC task.
4.1.1. Regular expressions (RE) lookup
RE lookup is mostly useful for segmentation of text into tokens and to describe patterns that
match numerical entities such as monetary expression, dates, email address, phone numbers such
as ([०-९]+[,])*[०-९]([.] [०-९]+)?. RE lookup can be applied for Marathi named entity recognition to
identify numerical patterns, date and time expressions etc.
4.1.2. Morphological analysis
Morphological analysis is the process of decomposing words into its constituents [8]. It is used to
build NER lexicons that are represented by finite state networks. NER lexicons consists of word
form followed by its optional lemmas, morphological feature and NE tag. Heuristic are derived
from morphology and semantic of input. Morphological analysis can give information about
word form inflections that is beneficial for NE tagging [9]. Named Entity recognition techniques
that are used to focus on tokens and their associated features are illustrated in fig.3.
POS-LEFT
Contextual
words
Morphologic
al Prefixes
Morphologic
al Suffixes
Short Type
POS-RIGHT
Part of
Speech
(POS)
Word
Features
9. International Journal on Natural Language Computing (IJNLC) Vol. 5, No.1, February 2016
23
Figure 3. Word level Named Entity Recognition
4.1.3. POS tagging
Part of speech of a word (Table 8) plays vital role in NER [10]. POS patterns highlights word
forms such as proper names that can be named entities, verbs, nouns etc. POS of token can mark
it as a noun in Marathi text. Further nouns can be classified as common noun or proper noun. All
the extracted proper nouns will be the candidates for classification into appropriate NE class such
as people, location organization etc.
Table 8. Marathi Word with its POS tag and NE class
Marathi Word POS tag NE class
jुती Proper Noun Person
जळगाव Proper Noun Location
नद Common Noun No entity
पु तक Common Noun No entity
.नरमा Proper Noun Organization
4.1.4. Shallow parsing
A position of the word phrase in sentence helps in understanding its role and meaning in
construction. Shallow parsing at syntactic level can analyze arrangement of words in
construction. Proper names often have some implicit structure that can be used as internal
evidence to identify its entity type. Names are also often surrounded by predictive words that can
be used as external evidence in NER. Marathi language also has lot of predictive words that
surrounds NE can act as external evidence for shallow parsing. For instance the word appearing
before the word ‘यांनी’ (yanee), ‘यांचे’ (yanche) will definitely be a person name. Similarly the word
appearing before ‘येथे’ (yethe), ‘इथून’ (ithoon) will be location, the token appeared before the word
‘रोजी’ (roji) will be date expression etc. Title person such as ‘ङॉ.’ (Dr.), ‘jी.’ (shri), ‘सौ.’ (sau), ‘ ा.’
(pra) or designating words such as ‘ धानमंoी’ (pradhanmantree), ‘राRयपाल’ (Rajyapal), ‘गृहमंoी’
(gruhmantri) also can be used to recognize people in the Marathi text. The expression <ता.> + {२
माच} + <रोजी> can be used to recognize date expressions in the text.
10. International Journal on Natural Language Computing (IJNLC) Vol. 5, No.1, February 2016
24
4.2. List lookup based recognition
Lists of named entities also called as gazetteers, dictionary or lexicon are compressive lists that
includes names of organizations, people, continents, countries, capitals of countries, states, cities,
geographical regions, government, celebrities, airlines, events, month names, days of week [2]
etc. Word phrases are compared with entities in various gazetteers. For instance, if word phrase is
found in list of countries then probability of that word with entity type location becomes higher.
The list lookup can be done using either exact full string matching, or using filtering operations
such as stemming or lemmatization, or using phonetic matching techniques such as soundex or
editex or using approximate matching techniques which calculate distance between strings to be
matched [11]. The advantages of this technique are that it is simple, fast and gives accurate
results. But the downside is generation and expansion of lists is expensive. Lists cannot deal with
variance, also cannot resolve ambiguity [12]. Words in sentences of the natural language text can
have variations, inflections or derivations. Therefore techniques that resolve such variations are
needed so that string matching can be done. Some useful techniques that can be used for matching
words in test text against words in various lists, gazetteers or dictionaries are shown in fig. 4,
Figure 4. List lookup Techniques for Named Entity Recognition
4.2.1. Stemming
Stemming is the technique used to remove affixes from words. The part of the word which is
common to all variants is called stem. Candidate words are trimmed out for inflectional and
derivational suffixes as shown in Table 9. These stemmed words are further matched with words
in lists to find its correct entity type.
Table 9. Stemming of Marathi Words
Inflected Word Form Stemmed Word Form
सावरकरां या सावरकर
सेहवागसह सेहवाग
काँVेसने काँVेस
राSTवाद तील राSTवाद
द(ल त द(ल
11. International Journal on Natural Language Computing (IJNLC) Vol. 5, No.1, February 2016
25
4.2.2. Lemmatization
Lemmatization is the technique that finds a match between the word and an entry in the language
vocabulary or knowledge base. If the word gets partially matched then it generates list of possible
lemma. Lemmatization includes basic word variations like singular vs. plural, thesaurus or
synonym matching. It normalizes inflections. Table 10 shows some Marathi language instances
which show need of lemmatization.
Table 10. Lemmatization
Singular Plural Word Synonym
खुच6 खु या वषा पाऊस
बॅग बॅगा सुय रवी
बाटल बाट(या चंa शशी
पाकळी पाकqया राo .नशा
फु लपाखr फु लपाखरे आनंद हष
4.2.3. Canonical normalization
This technique can use Unicode normalizer. Diacritics are additional glyphs added to letters. They
are mainly used to change the sound value of the letter in the word. For string matching ignoring
diacritics is useful. Canonical normalization identifies characters that have the same meaning but
different appearance and replaces them with a matching character. It is useful for lists matching
to extract correct entity type of the word. e.g. वता -> वत:
4.2.4. Threshold edit-distance
This technique uses search and extension based algorithms. Edit distance technique is used to deal
with typographic and spelling mistakes and to determine syntactic closure of two strings. The
distance between two strings is defined as the minimal number of edits required to convert one
into the other. Character level operations used are insertion, deletion, replacement and transpose.
To find closest in a large dictionary entries are organized in trie, partitioned by length by
assuming small edit distance called as threshold. e.g. क3 द -> क3 a.
4.2.5. Jaccard distance
Jaccard distance [13] measures dissimilarity between sample sets. It returns a distance between
two items to measure how close or similar items altogether. The strings to be compared are first
tokenized. Jaccard distance is computed by dividing the number of tokens shared by the strings to
the total number of tokens. The comparisons are either character level, sentence level or n-gram
level. e.g. पावसाqयात नद काठu असलेल हरवळ मनमोहक असते. -> मनमोहक असलेल हरवळ
पावसाqयात नद काठu असते.
4.2.6. Jaro-Winkler distance
Distance between strings is based on vector similarity using the cosine measure and weighted
term frequencies. Two strings are more similar if they contain many of the same tokens with the
12. International Journal on Natural Language Computing (IJNLC) Vol. 5, No.1, February 2016
26
same relative number of occurrences of each. Tokens are weighted more heavily if they occur in
few documents [13].
4.2.7. Soundex
Soundex converts a word into its uttered sound. Basic idea is similarly spelled or pronounced
words are decoded such that the words build the same code [14]. e.g. आणी -> आ_ण.
4.2.8. Editex
It is an enhancement of Levenshtein edit distance technique. Editex groups similar phonemes into
equivalent classes. Editex straightforwardly combines Levenshtein distance with ‘letter groups’
(aeiouy, bp, ckq, etc.) such that letters in a similar group frequently correspond to similar
phonemes [15]. e.g. मुंबई टाइJस -> मुंबई टाईJस
4.3. Machine learning using corpus based recognition
Word level and list look up based recognition gives accurate and high performance but needs
exhaustive linguistic resources and grammar expertise. Systems developed with these approaches
cannot be portable to other domains. Therefore many researchers have given preference to corpus
based recognition. It is also known as stochastic or statistical machine learning. Fig. 5 explores
data driven approach (Corpus based Recognition) for named entity recognition. Data driven
approach is simply using supervised; semi supervised or unsupervised learning methods. Various
statistical models used to design named entity recognition and classification systems are
explained as follows,
Figure 5. Machine learning approaches for Named Entity Recognition
4.3.1. Hidden Markov Models (HMM)
HMM is a statistical language model that computes the likelihood of a sequence of words by
employing a Markov chain, in which likelihood of next word is based on the current word. In this
language model words are represented by states, NE classes are represented by regions and there
is a probability associated with every transition from the current word to the next word. This
model can predict the NE class for a next word if current word and its NE class pair is given. The
13. International Journal on Natural Language Computing (IJNLC) Vol. 5, No.1, February 2016
27
Viterbi algorithm is used to find the sequence of NE classes with highest probability. jी. रतन
टाटा हे भारतातील एक य़श वी 8य़Dतीम व होय., in this sentence if, P ([रतन]|[ jी.], PER-NAME) > P
([रतन]|[jी.], ORG-NAME) then, the word रतन will be person NE as its probability of being a
person NE is greater than being an organization in a sentence. HMM is supervised learning
algorithm. To develop a system that can recognize named entities using HMM needs tokenizer,
large named entity tagged training corpus, N-GRAM language models, implementation of Viterbi
algorithm for tagging and implementation of Baum-Welch algorithm to improve the parameters
of HMM for unmarked texts.
4.3.2. Maximum Entropy Models (MEMM)
Maximum Entropy Model computes probability distribution based on maximum entropy that
satisfies the constraints set by training examples. Entropy is measure of uncertainty and
randomness of the event [16]. NER system implementation using Maximum Entropy needs
tagged training corpus, tokenizer, language parser with at least a minimal amount of parsing
technology to the system, morphological analyzer, ME model for tagging of unknown words.
4.3.3. Conditional Random Fields (CRF)
CRF is relational learning model.NER using CRF is based on undirected graphical model of
conditionally trained probabilistic finite state automata. CRF is used to calculate the conditional
probability of values on designated output nodes given values on other designated input nodes. It
incorporates dependent features and context dependent learning. It allows representing
dependencies on previous classifications in a discourse. The basic idea is context surrounding
name becomes good evidence in tagging another occurrence of the same name in a different
ambiguous context. NER using CRF implementation needs feature vector consisting of language
features and POS tags, morphological analyzer, gazetteers and NE annotated corpus.
4.3.4. Support Vector Machines (SVM)
SVM is binary classification technique used to classify named entities. Each sample is
represented by a long binary vector. The SVM is trained on a large training corpus. Each sample
is represented by a long binary vector. Each training sample is represented by a point plotted into
a hyperspace. Then SVM attempts to draw a hyper plane between the two categories, splitting the
points as evenly as possible. This self training supervised learning model needs annotated text for
instance 500,000 words tagged with its POS and IOB tag, morphology analyzer and SVM
implementation.
4.3.5. Adaboost
In this method recognition is done using binary classifiers to label the words. In this task the
context of the current word is used and neighboring word is coded as feature along with the
relative position. Then NE classification is done by using multiclass multilabel AdaBoost
algorithm [20].
4.3.6. Decision Trees (DT)
Supervised learning system using decision tree uses information derived from previous, current
and next word by asking sequence of questions about the history to determine possible output of
the model which is NE tag for a word. Decision Trees model consider a language model which
14. International Journal on Natural Language Computing (IJNLC) Vol. 5, No.1, February 2016
28
attempts to determine the next word in the text given a history consisting of the two previous
words. The idea is to build a tree by asking questions at every point reducing uncertainty about
the set of features. NER system implementation using decision tree needs segmentation of text
into tokens, POS tag for a token and prefix or suffix word dictionaries and list of words that are
candidate NEs.
4.3.7. Bootstrapping
Bootstrapping refers to the process by which an initial kernel of language is acquired by explicit
instruction [17]. In bootstrapping set of seeds are used to start the process. The system then
searches the sentences that have these seeds and tries to identify contextual clues. Then system
tries to find other instances of seeds found in similar context.
4.3.8. Clustering
This is unsupervised learning technique useful for NERC problem. Automatic generation of
groups in data based on numerical distance or similarity between objects is called clustering.
Important factors in clustering are feature selection, similarity function and set of constraints such
as shape of cluster or cluster membership, number of clusters etc.
Table 11. Statistical techniques, features and reported F1 measure for NERC
Technique Features used F1 Score
HMM[18] Dictionaries, Hand crafted rules for numeric entities 90.93
MEMM[19] Lists derived from training data, Local features, Global
features, Name lists
83.31
CRF[20] Word level features, Probability features, Suffix
features, Bigram & Trigram features
93.65
SVM[21] Contextual Window of size 6, Prefix, suffix up to length
3, POS information, Gazetteers, Lists derived from
training data
91.80
Adaboost [22] Features used are Lexical, Syntactic, Orthographic,
Affixes, Word Type Patterns, Left Predictions, Bag-of-
Word, Trigger Word and Gazetteer
85.00
Decision Trees[23] Syntactic properties, POS information, Gazetteers. 94.28
Boot-strapping[17] Labeled data, a seed classifier, gazetteers 81.62
Clustering[24] Permanency feature (ratio of frequency of word in
corpus to the frequency of all occurrences of the word
with case insensitive consideration), Standard deviation
of probability & the length of word in name entity.
66.21
Important factors in unsupervised approach clustering that affects named entity recognition are
ambiguity in terms, conditional probability of the context for a specified semantic class and
syntactic construct of the terms and context. This technique needs categorization of proper names
(Person, Location, and Organization names), assimilation of seed set and their feature vectors,
classification of unknown words based on shared context or context similarity with seeds.
Similarity measurements based on shared context words, similar hypernym in WordNet,
congruence of the syntactical function of the name in the sentence [25]. Table 11 summarizes
various techniques used for implementation of named entity recognition.
15. International Journal on Natural Language Computing (IJNLC) Vol. 5, No.1, February 2016
29
5. CONCLUSIONS
Named entity recognition is very important task of Information Extraction systems. This task is
used to give structure to the raw unstructured information written in natural language. NLP tool
development is necessary to enrich the Marathi language technology. NER is difficult for Indian
languages and implementation of Marathi NER system is much more difficult and challenging
because of various issues like the inherent agglutinative and inflectional nature of Marathi,
ambiguities in named entity classes, non local dependencies, appearances of foreign words,
spelling variations etc. This paper has explored various methodologies and techniques that may
be used in designing Marathi named entity recognition system.
ACKNOWLEDGEMENTS
This research work is supported by grants under Vice Chancellor Research Motivation Scheme
(VCRMS) of North Maharashtra University, Jalgaon and SAP (DRS-I), UGC New Delhi, India
REFERENCES
[1] “Text Analytics–Named Entity Extraction”, Whitepaper, Lexalytics, (2012) Retrieved from
http://www.angoss.com/wp-content/uploads/2012/10/Text-Analytics-Named-Entity-Extraction-
White-Paper.pdf
[2] David Nadeau and Satoshi Sekine, (2007) “Survey of Named Entity Recognition and Classification”,
Journal of Linguisticae Investigationes, Vol. 30, No. 1, pp 1-20.
[3] Shaalan, K., (2014) “A Survey of Arabic Named Entity Recognition and Classification”,
Computational Linguistics, Vol. 40, No. 2, pp 469-510, MIT Press.
[4] B. Sasidhar, P. M. Yohan, A. Vinaya Babu, A Goverdhan, (2011) “A Survey on Named Entity
Recognition in Indian Languages with particular reference to Telugu”, International Journal of
Computer Science Issues, Vol. 8, No. 2, pp 438-443.
[5] Padmaja Sharma, Utpal Sharma and Jugal Kalita,(2011) “Named Entity Recognition: A Survey for
the Indian Languages”, Journal of Language in India, Special Volume: Problems of Parsing in Indian
Languages, pp 35-40.
[6] McEnery, A. M., & Xiao, R. Z. (2005). Character encoding in corpus construction. In: Wynne, M.
(ed.): Developing Linguistic Corpora: A Guide to Good Practice. Oxford: AHDS, 47-58.
[7] P.Srikanth, K. N. Murthy, (2008) “Named Entity Recognition for Telegu”, IJCNLP-08 Workshop on
NER for South and South East Asian languages, Hyderabad, India, pp 41–50.
[8] Kemal Oflazer, (1999) “Morphological Analysis”, Syntactic Word class Tagging Text, Speech and
Language Technology, Vol.9, pp 175-205.
[9] Smruthi Mukund, Rohini Shrihari and Erik Peterson, (2010) “An Information- Extraction System for
Urdu- A Resource Poor Language”, ACM Transactions on Asian Language Information Processing,
Vol. 9, No. 4, pp. 1-43
[10] H B Patil, A S Patil and B V Pawar (2014) “Part-of-Speech Tagger for Marathi Language using
Limited Training Corpora”, IJCA Proceedings on National Conference on Recent Advances in
Information Technology NCRAIT(4), pp. 33-37.
[11] Artem Boldyrev,(2013) “Dictionary-Based Named Entity Recognition”, Master’s Thesis in Computer
Science, University at des Saarlandes.
[12] Shaalan, K., and H. Raza, (2007) “Person Name Entity Recognition for Arabic”, ACL 2007
Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources,
Prague, Czech Republic, Association for Computational Linguistics, pp. 17–24
[13] William W. Cohen , Sunita Sarawagi, (2004) “Exploiting Dictionaries in Named Entity Extraction:
Combining Semi-Markov Extraction Processes and Data Integration Methods”, International
Conference on Knowledge Discovery and Data Mining, Seattle, Washington, USA, pp 89-98.
16. International Journal on Natural Language Computing (IJNLC) Vol. 5, No.1, February 2016
30
[14] Hema Raghavan, James Allan, (2004) “Using Soundex Codes for Indexing Names in ASR
Documents”, Workshop on Interdisciplinary Approaches to Speech Index Retrieval, at HLT-NAACL,
Boston, Massachusetts, pp 22-27.
[15] Zobel, J., Dart, P.,( 1996) “Phonetic String Matching: Lessons from Information Retrieval”, ACM
SIGIR , ACM, New York, pp 166-172.
[16] Chieu, H. L., Ng Hwee Tou, (2002) “Named Entity Recognition: A Maximum Entropy Approach
using Global Information”, 19th International Conference on Computational Linguistics (COLING
2002), San Francisco, pp 190-196.
[17] Zornitsa Kozareva, (2006) “Bootstrapping named entity recognition with automatically generated
gazetteer lists”, 11th
Conference of the European Chapter of the Association for Computational
Linguistics: Student Research Workshop (EACL '06), Stroudsburg, PA, USA, pp 15-21.
[18] Todorovic, B.T., Rancic, S.R., Markovic, I.M., Mulalic, E.H., Ilic, V.M.,(2008) "Named Entity
Recognition and Classification using Context Hidden Markov Model," Neural Network Applications
in Electrical Engineering, NEUREL 2008, pp 43-46.
[19] Hai Leong Chieu, Hwee Tou Ng, (2003) “Named Entity Recognition with a Maximum Entropy
Approach”, 7th
Conference on Natural Language Learning, Association for Computational
Linguistics, Stroudsburg, PA, USA, Vol. 4 pp 160-163 .
[20] Suxiang Zhang, Suxian Zhang, Xiaojie Wang, (2007) “Automatic Recognition of Chinese
Organization Name Based on Conditional Random Fields”, International Conference on Natural
Language Processing and Knowledge Engineering (NLP-KE 2007), Beijing, China, pp 229-233.
[21] Asif Ekbal, Sivaji Bandyopadhyay,(2008) “Bengali Named Entity Recognition using Support Vector
Machine” ,Workshop on NER for South and South East Asian Languages (IJCNLP-08), Hyderabad,
India, pp 51–58.
[22] Xavier Carreras, Llu´ıs Marquez, Llu´ıs Padro, (2003) “A Simple Named Entity Extractor using
AdaBoost”, 7th
Conference on Natural Language Learning, Association for Computational
Linguistics, Stroudsburg, PA, USA, Vol. 4, pp 152-155.
[23] Georgios Paliouras, Vangelis Karkaletsis, Georgios Petasis, and Constantine D. Spyropoulos, (2000)
“Learning Decision Trees for Named-entity Recognition and Classification”, ECAI Workshop on
Machine Learning for Information Extraction.
[24] Da Silva, Joaquim Ferreira, Kozareva, Z., Lopes, G. P., (2004) “Cluster Analysis and Classification of
Named Entities”, 4th International Conference on Language Resources and Evaluation (LREC 2004),
Lisbon, Portugal, pp 321-324
[25] Agichtein, Eugene, Luis Gravano, (2003) “Querying Text Databases for Efficient Information
Retreival”, 19th International Conference on Data Engineering, IEEE computer Society, Columbia
Univ., USA, pp 113-124.
[26] Steven Abnvey, (2007) “Semisupervised Learning for Computational Linguistics”, Chapman &
Hall/CRC.