Based on the sense definition of words available in the Bengali WordNet, an attempt is made to classify the
Bengali sentences automatically into different groups in accordance with their underlying senses. The input
sentences are collected from 50 different categories of the Bengali text corpus developed in the TDIL
project of the Govt. of India, while information about the different senses of particular ambiguous lexical
item is collected from Bengali WordNet. In an experimental basis we have used Naive Bayes probabilistic
model as a useful classifier of sentences. We have applied the algorithm over 1747 sentences that contain a
particular Bengali lexical item which, because of its ambiguous nature, is able to trigger different senses
that render sentences in different meanings. In our experiment we have achieved around 84% accurate
result on the sense classification over the total input sentences. We have analyzed those residual sentences
that did not comply with our experiment and did affect the results to note that in many cases, wrong
syntactic structures and less semantic information are the main hurdles in semantic classification of
sentences. The applicational relevance of this study is attested in automatic text classification, machine
learning, information extraction, and word sense disambiguation
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...ijnlc
Word Sense Disambiguation is a classification of me
aning of word in a precise context which is a trick
y
task to perform in Natural Language Processing whic
h is used in application like machine translation,
information extraction and retrieval, automatic or
closed domain question answering system for the rea
son
that of its semantics perceptive. Researchers tried
for unsupervised and knowledge based learning
approaches however such approaches have not proved
more helpful. Various supervised learning
algorithms have been made, but in vain as the attem
pt of creating the training corpus which is a tagge
d
sense marked corpora is tricky. This paper presents
a hybrid approach for resolving ambiguity in a
sentence which is based on integrating lexical know
ledge and world knowledge. English Wordnet
developed at Princeton University, SemCor corpus an
d the JAWS library (Java API for WordNet
searching) has been used for this purpose.
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGEScsandit
ABSTRACT
Natural Language processing is an interdisciplinary branch of linguistic and computer science studied under the Artificial Intelligence (AI) that gave birth to an allied area called
‘Computational Linguistic’ which focuses on processing of natural languages on computational devices. A natural language consists of a large number of sentences which are linguistic units involving one or more words linked together in accordance with a set of predefined rules called grammar. Grammar checking is the task of validating sentences syntactically and is a prominent tool within language engineering. Our review draws on the recent development of various grammar checkers to look at past, present and the future in a new light. Our review covers grammar checkers of many languages with the aim of seeking their approaches, methodologies for developing new tool and system as a whole. The survey concludes with the discussion of various features included in existing grammar checkers of foreign languages as well as a few Indian Languages.
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
In recent decades speech interactive systems have gained increasing importance. Performance of an ASR system mainly depends on the availability of large corpus of speech. The conventional method of building a large vocabulary speech recognizer for any language uses a top-down approach to speech. This approach requires large speech corpus with sentence or phoneme level transcription of the speech utterances. The transcriptions must also include different speech order so that the recognizer can build models for all the sounds present. But, for Telugu language, because of its complex nature, a very large, well annotated speech database is very difficult to build. It is very difficult, if not impossible, to cover all the words of any Indian language, where each word may have thousands and millions of word forms. A significant part of grammar that is handled by syntax in English (and other similar languages) is handled within morphology in Telugu. Phrases including several words (that is, tokens) in English would be mapped on to a single word in Telugu.Telugu language is phonetic in nature in addition to rich in morphology. That is why the speech technology developed for English cannot be applied to Telugu language. This paper highlights the work carried out in an attempt to build a voice enabled text editor with capability of automatic term suggestion. Main claim of the paper is the recognition enhancement process developed by us for suitability of highly inflecting, rich morphological languages. This method results in increased speech recognition accuracy with very much reduction in corpus size. It also adapts Telugu words to the database dynamically, resulting in growth of the corpus.
International Journal of Engineering and Science Invention (IJESI) inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online
An Improved Approach for Word Ambiguity RemovalWaqas Tariq
Word ambiguity removal is a task of removing ambiguity from a word, i.e. correct sense of word is identified from ambiguous sentences. This paper describes a model that uses Part of Speech tagger and three categories for word sense disambiguation (WSD). Human Computer Interaction is very needful to improve interactions between users and computers. For this, the Supervised and Unsupervised methods are combined. The WSD algorithm is used to find the efficient and accurate sense of a word based on domain information. The accuracy of this work is evaluated with the aim of finding best suitable domain of word. Keywords: Human Computer Interaction, Supervised Training, Unsupervised Learning, Word Ambiguity, Word sense disambiguation
Suitability of naïve bayesian methods for paragraph level text classification...ijaia
The amount of data present online is growing very rapidly, hence a need for organizing and categorizing
data has become an obvious need. The Information Retrieval (IR) techniques act as an aid in assisting
users in obtaining relevant information. IR in the Indian context is very relevant as there are several blogs,
news publications in Indian languages present online. This work looks at the suitability of Naïve Bayesian
methods for paragraph level text classification in the Kannada language. The Naïve Bayesian methods are
the most primitive algorithms for Text Categorization tasks. We apply dimensionality reduction technique
using Minimum term frequency, stop word identification and elimination methods for achieving the task. It
is evident that Naïve Bayesian Multinomial model outperforms simple Naïve Bayesian approach in
paragraph classification tasks.
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...ijnlc
Word Sense Disambiguation is a classification of me
aning of word in a precise context which is a trick
y
task to perform in Natural Language Processing whic
h is used in application like machine translation,
information extraction and retrieval, automatic or
closed domain question answering system for the rea
son
that of its semantics perceptive. Researchers tried
for unsupervised and knowledge based learning
approaches however such approaches have not proved
more helpful. Various supervised learning
algorithms have been made, but in vain as the attem
pt of creating the training corpus which is a tagge
d
sense marked corpora is tricky. This paper presents
a hybrid approach for resolving ambiguity in a
sentence which is based on integrating lexical know
ledge and world knowledge. English Wordnet
developed at Princeton University, SemCor corpus an
d the JAWS library (Java API for WordNet
searching) has been used for this purpose.
A SURVEY OF GRAMMAR CHECKERS FOR NATURAL LANGUAGEScsandit
ABSTRACT
Natural Language processing is an interdisciplinary branch of linguistic and computer science studied under the Artificial Intelligence (AI) that gave birth to an allied area called
‘Computational Linguistic’ which focuses on processing of natural languages on computational devices. A natural language consists of a large number of sentences which are linguistic units involving one or more words linked together in accordance with a set of predefined rules called grammar. Grammar checking is the task of validating sentences syntactically and is a prominent tool within language engineering. Our review draws on the recent development of various grammar checkers to look at past, present and the future in a new light. Our review covers grammar checkers of many languages with the aim of seeking their approaches, methodologies for developing new tool and system as a whole. The survey concludes with the discussion of various features included in existing grammar checkers of foreign languages as well as a few Indian Languages.
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
In recent decades speech interactive systems have gained increasing importance. Performance of an ASR system mainly depends on the availability of large corpus of speech. The conventional method of building a large vocabulary speech recognizer for any language uses a top-down approach to speech. This approach requires large speech corpus with sentence or phoneme level transcription of the speech utterances. The transcriptions must also include different speech order so that the recognizer can build models for all the sounds present. But, for Telugu language, because of its complex nature, a very large, well annotated speech database is very difficult to build. It is very difficult, if not impossible, to cover all the words of any Indian language, where each word may have thousands and millions of word forms. A significant part of grammar that is handled by syntax in English (and other similar languages) is handled within morphology in Telugu. Phrases including several words (that is, tokens) in English would be mapped on to a single word in Telugu.Telugu language is phonetic in nature in addition to rich in morphology. That is why the speech technology developed for English cannot be applied to Telugu language. This paper highlights the work carried out in an attempt to build a voice enabled text editor with capability of automatic term suggestion. Main claim of the paper is the recognition enhancement process developed by us for suitability of highly inflecting, rich morphological languages. This method results in increased speech recognition accuracy with very much reduction in corpus size. It also adapts Telugu words to the database dynamically, resulting in growth of the corpus.
International Journal of Engineering and Science Invention (IJESI) inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online
An Improved Approach for Word Ambiguity RemovalWaqas Tariq
Word ambiguity removal is a task of removing ambiguity from a word, i.e. correct sense of word is identified from ambiguous sentences. This paper describes a model that uses Part of Speech tagger and three categories for word sense disambiguation (WSD). Human Computer Interaction is very needful to improve interactions between users and computers. For this, the Supervised and Unsupervised methods are combined. The WSD algorithm is used to find the efficient and accurate sense of a word based on domain information. The accuracy of this work is evaluated with the aim of finding best suitable domain of word. Keywords: Human Computer Interaction, Supervised Training, Unsupervised Learning, Word Ambiguity, Word sense disambiguation
Suitability of naïve bayesian methods for paragraph level text classification...ijaia
The amount of data present online is growing very rapidly, hence a need for organizing and categorizing
data has become an obvious need. The Information Retrieval (IR) techniques act as an aid in assisting
users in obtaining relevant information. IR in the Indian context is very relevant as there are several blogs,
news publications in Indian languages present online. This work looks at the suitability of Naïve Bayesian
methods for paragraph level text classification in the Kannada language. The Naïve Bayesian methods are
the most primitive algorithms for Text Categorization tasks. We apply dimensionality reduction technique
using Minimum term frequency, stop word identification and elimination methods for achieving the task. It
is evident that Naïve Bayesian Multinomial model outperforms simple Naïve Bayesian approach in
paragraph classification tasks.
Named Entity Recognition for Telugu Using Conditional Random FieldWaqas Tariq
Named Entity (NE) recognition is a task in which proper nouns and numerical information are extracted from documents and are classified into predefined categories such as Person names, Organization names , Location names, miscellaneous(Date and others). It is a key technology of Information Extraction, Question Answering system, Machine Translations, Information Retrial etc. This paper reports about the development of a NER system for Telugu using Conditional Random field (CRF). Though this state of the art machine learning technique has been widely applied to NER in several well-studied languages, the use of this technique to Telugu languages is very new. The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the four different named entities (NE) classes, such as Person name, Location name, Organization name, miscellaneous (Date and others). Keywords: Named entity, Conditional Random field, NE, CRF, NER, named entity recognition
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
Natural Language Processing (NLP) techniques are one of the most used techniques in the field of computer applications. It has become one of the vast and advanced techniques. Language is the means of communication or interaction among humans and in present scenario when everything is dependent on machine or everything is computerized, communication between computer and human has become a necessity. To fulfill this necessity NLP has been emerged as the means of interaction which narrows the gap between machines (computers) and humans. It was evolved from the study of linguistics which was passed through the Turing test to check the similarity between data but it was limited to small set of data. Later on various algorithms were developed along with the concept of AI (Artificial Intelligence) for the successful execution of NLP. In this paper, the main emphasis is on the different techniques of NLP which have been developed till now, their applications and the comparison of all those techniques on different parameters.
Natural Language Toolkit (NLTK) is a generic platform to process the data of various natural (human)
languages and it provides various resources for Indian languages also like Hindi, Bangla, Marathi and so
on. In the proposed work, the repositories provided by NLTK are used to carry out the processing of Hindi
text and then further for analysis of Multi word Expressions (MWEs). MWEs are lexical items that can be
decomposed into multiple lexemes and display lexical, syntactic, semantic, pragmatic and statistical
idiomaticity. The main focus of this paper is on processing and analysis of MWEs for Hindi text. The
corpus used for Hindi text processing is taken from the famous Hindi novel “KaramaBhumi by Munshi
PremChand”. The result analysis is done using the Hindi corpus provided by Resource Centre for Indian
Language Technology Solutions (CFILT). Results are analysed to justify the accuracy of the proposed
work.
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig
Many Natural Language Processing (NLP) applications involve Named Entity Recognition (NER) as an important task, where it leads to improve the overall performance of NLP applications. In this paper the Deep learning techniques are used to perform NER task on Hindi text data as it found that as compared to English NER, Hindi language NER is not sufficiently done. This is a barrier for resource-scarce languages as many resources are not readily available. Many researchers use various techniques such as rule based, machine learning based and hybrid approaches to solve this problem. Deep learning based algorithms are being developed in large scale as an innovative approach now a days for the advanced NER models which will give the best results out of it. In this paper we devise a Novel architecture based on residual network architecture for preferably Bidirectional Long Short Term Memory (BiLSTM) with fasttext word embedding layers. For this purpose we use pre-trained word embedding to represent the words in the corpus where the NER tags of the words are defined as the used annotated corpora. BiLSTM Development of an NER system for Indian languages is a comparatively difficult task. In this paper, we have done the various experiments to compare the results of NER with normal embedding and fasttext embedding layers to analyse the performance of word embedding with different batch sizes to train the deep learning models. Here we present a state-of-the-art results with said approach F1 Score measures.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...ijtsrd
Natural Language Processing NLP is the one of the major filed of Natural Language Generation NLG . NLG can generate natural language from a machine representation. Generating suggestions for a sentence especially for Indian languages is much difficult. One of the major reason is that it is morphologically rich and the format is just reverse of English language. By using deep learning approach with the help of Long Short Term Memory LSTM layers we can generate a possible set of solutions for erroneous part in a sentence. To effectively generate a bunch of sentences having equivalent meaning as the original sentence using Deep Learning DL approach is to train a model on this task, e.g. we need thousands of examples of inputs and outputs with which to train a model. Veena S Nair | Amina Beevi A ""Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Learning"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd23842.pdf
Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/23842/suggestion-generation-for-specific-erroneous-part-in-a-sentence-using-deep-learning/veena-s-nair
Exploiting rules for resolving ambiguity in marathi language texteSAT Journals
Abstract
Natural language ambiguity is a situation involving some words having multiple meanings/senses. This paper discusses natural
language ambiguity and its types. Further we propose a knowledge based solution to resolve various types of ambiguity occurring
in Marathi language text. The task of resolving semantic and lexical ambiguity occurring in words to obtain the actual sense is
referred as Word Sense Disambiguation (WSD). Marathi language is the official and commonly spoken language of Maharashtra
state in India. Plenty of words in Marathi are spelled same as well as uttered same but are semantically (meaning-wise/ sensewise)
different. During the automatic translation, these words lead to ambiguity. Our method successfully removes the ambiguity
by identifying the correct sense of the given text from the predefined possible senses available in Marathi Wordnet using word and
sentence rules. The method is applicable only for word level ambiguity. Structural ambiguity is not handled by this system. This
system may be successfully used as a subsystem in other Natural Language Processing (NLP) applications.
Key Words: Word Sense Disambiguation, Natural Language Processing, Marathi, Marathi Wordnet, ambiguity,
knowledge based
Parameters Optimization for Improving ASR Performance in Adverse Real World N...Waqas Tariq
From the existing research it has been observed that many techniques and methodologies are available for performing every step of Automatic Speech Recognition (ASR) system, but the performance (Minimization of Word Error Recognition-WER and Maximization of Word Accuracy Rate- WAR) of the methodology is not dependent on the only technique applied in that method. The research work indicates that, performance mainly depends on the category of the noise, the level of the noise and the variable size of the window, frame, frame overlap etc is considered in the existing methods. The main aim of the work presented in this paper is to use variable size of parameters like window size, frame size and frame overlap percentage to observe the performance of algorithms for various categories of noise with different levels and also train the system for all size of parameters and category of real world noisy environment to improve the performance of the speech recognition system. This paper presents the results of Signal-to-Noise Ratio (SNR) and Accuracy test by applying variable size of parameters. It is observed that, it is really very hard to evaluate test results and decide parameter size for ASR performance improvement for its resultant optimization. Hence, this study further suggests the feasible and optimum parameter size using Fuzzy Inference System (FIS) for enhancing resultant accuracy in adverse real world noisy environmental conditions. This work will be helpful to give discriminative training of ubiquitous ASR system for better Human Computer Interaction (HCI). Keywords: ASR Performance, ASR Parameters Optimization, Multi-Environmental Training, Fuzzy Inference System for ASR, ubiquitous ASR system, Human Computer Interaction (HCI)
Phrase Structure Identification and Classification of Sentences using Deep Le...ijtsrd
Phrase structure is the arrangement of words in a specific order based on the constraints of a specified language. This arrangement is based on some phrase structure rules which are according to the productions in context free grammar. The identification of the phrase structure can be done by breaking the specified natural language sentence into its constituents that may be lexical and phrasal categories. These phrase structures can be identified using parsing of the sentences which is nothing but syntactic analysis. The proposed system deals with this problem using Deep Learning strategy. Instead of using Rule Based technique, supervised learning with sequence labelling is done using IOB labelling. This is a sequence classification problem which has been trained and modeled using RNN LSTM. The proposed work has shown a considerable result and can be applied in many applications of NLP. Hashi Haris | Misha Ravi ""Phrase Structure Identification and Classification of Sentences using Deep Learning"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd23841.pdf
Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/23841/phrase-structure-identification-and-classification-of-sentences-using-deep-learning/hashi-haris
A comparative analysis of particle swarm optimization and k means algorithm f...ijnlc
The volume of digitized text documents on the web have been increasing rapidly. As there is huge collection
of data on the web there is a need for grouping(clustering) the documents into clusters for speedy
information retrieval. Clustering of documents is collection of documents into groups such that the
documents within each group are similar to each other and not to documents of other groups. Quality of
clustering result depends greatly on the representation of text and the clustering algorithm. This paper
presents a comparative analysis of three algorithms namely K-means, Particle swarm Optimization (PSO)
and hybrid PSO+K-means algorithm for clustering of text documents using WordNet. The common way of
representing a text document is bag of terms. The bag of terms representation is often unsatisfactory as it
does not exploit the semantics. In this paper, texts are represented in terms of synsets corresponding to a
word. Bag of terms data representation of text is thus enriched with synonyms from WordNet. K-means,
Particle Swarm Optimization (PSO) and hybrid PSO+K-means algorithms are applied for clustering of
text in Nepali language. Experimental evaluation is performed by using intra cluster similarity and inter
cluster similarity.
A prior case study of natural language processing on different domain IJECEIAES
In the present state of digital world, computer machine do not understand the human’s ordinary language. This is the great barrier between humans and digital systems. Hence, researchers found an advanced technology that provides information to the users from the digital machine. However, natural language processing (i.e. NLP) is a branch of AI that has significant implication on the ways that computer machine and humans can interact. NLP has become an essential technology in bridging the communication gap between humans and digital data. Thus, this study provides the necessity of the NLP in the current computing world along with different approaches and their applications. It also, highlights the key challenges in the development of new NLP model.
KANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATIONijnlc
Named Entity Recognition and Classification (NERC) is a process of identification of proper nouns in the text and classification of those nouns into certain predefined categories like person name, location,organization, date, and time etc. NERC in Kannada is an essential and challenging task. The aim of this work is to develop a novel model for NERC, based on Multinomial Naïve Bayes (MNB) Classifier. The Methodology adopted in this paper is based on feature extraction of training corpus, by using term frequency, inverse document frequency and fitting them to a tf-idf-vectorizer. The paper discusses the
various issues in developing the proposed model. The details of implementation and performance evaluation are discussed. The experiments are conducted on a training corpus of size 95,170 tokens and test corpus of 5,000 tokens. It is observed that the model works with Precision, Recall and F1-measure of
83%, 79% and 81% respectively.
Author Credits - Maaz Anwar Nomani
Semantic Role Labeler (SRL) is a semantic parser which can automatically identify and then classify arguments of a verb in a natural language sentence for Hindi and Urdu. For e.g. in the natural language sentence “Sara won the competition because of her hard work.”, ‘won’ is the main verb and there are 3 arguments for this verb; ‘Sara’ (Agent), ‘hard work’ (Reason) and ‘competition’ (Theme). The problem statement of a SRL revolves around the fact that how will you make a machine identify and then classify the arguments of a verb in a natural language sentence.
Since there are 2 sub problem statements here (Identification and Classification), our SRL has a pipeline architecture in which a binary classifier (Logistic Regression) is first trained to identify whether a word is an argument to a verb in a sentence or not (Yes or No) and subsequently a multi-class classifier (SVM with Linear kernel) is trained to classify the identified arguments by above binary classifier into one of the 20 classes. These 20 classes are the various notions present in a natural language sentence (for e.g. Agent, Theme, Location, Time, Purpose, Reason, Cause etc.). These ‘notions’ are called Propbank labels or semantic labels present in a Proposition Bank which is a collection of hand-annotated sentences.
In essence, SRL felicitates Semantic Parsing which essentially is the research investigation of identifying WHO did WHAT to WHOM, WHERE, HOW, WHY and WHEN etc. in a natural language sentence.
Sentiment analysis is an important current research area. The demand for sentiment analysis and classification is growing day by day; this paper presents a novel method to classify Urdu documents as previously no work recorded on sentiment classification for Urdu text. We consider the problem by determining whether the review or sentence is positive, negative or neutral. For the purpose we use two machine learning methods Naïve Bayes and Support Vector Machines (SVM) . Firstly the documents are preprocessed and the sentiments features are extracted, then the polarity has been calculated, judged and classify through Machine learning methods.
Abstract This paper represents a Semantic Analyzer for checking the semantic correctness of the given input text. We describe our system as the one which analyzes the text by comparing it with the meaning of the words given in the WordNet. The Semantic Analyzer thus developed not only detects and displays semantic errors in the text but it also corrects them. Keywords: Part of Speech (POS) Tagger, Morphological Analyzer, Syntactic Analyzer, Semantic Analyzer, Natural Language (NL)
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Words can have more than one distinct meaning and many words can be interpreted in multiple ways
depending on the context in which they occur. The process of automatically identifying the meaning of
a polysemous word in a sentence is a fundamental task in Natural Language Processing (NLP). This
phenomenon poses challenges to Natural Language Processing systems. There have been many efforts
on word sense disambiguation for English; however, the amount of efforts for Amharic is very little.
Many natural language processing applications, such as Machine Translation, Information Retrieval,
Question Answering, and Information Extraction, require this task, which occurs at the semantic level.
In this thesis, a knowledge-based word sense disambiguation method that employs Amharic WordNet
is developed. Knowledge-based Amharic WSD extracts knowledge from word definitions and relations
among words and senses. The proposed system consists of preprocessing, morphological analysis and
disambiguation components besides Amharic WordNet database. Preprocessing is used to prepare the
input sentence for morphological analysis and morphological analysis is used to reduce various forms
of a word to a single root or stem word. Amharic WordNet contains words along with its different
meanings, synsets and semantic relations with in concepts. Finally, the disambiguation component is
used to identify the ambiguous words and assign the appropriate sense of ambiguous words in a
sentence using Amharic WordNet by using sense overlap and related words.
We have evaluated the knowledge-based Amharic word sense disambiguation using Amharic
WordNet system by conducting two experiments. The first one is evaluating the effect of Amharic
WordNet with and without morphological analyzer and the second one is determining an optimal
windows size for Amharic WSD. For Amharic WordNet with morphological analyzer and Amharic
WordNet without morphological analyzer we have achieved an accuracy of 57.5% and 80%,
respectively. In the second experiment, we have found that two-word window on each side of the
ambiguous word is enough for Amharic WSD. The test results have shown that the proposed WSD
methods have performed better than previous Amharic WSD methods.
Keywords: Natural Language Processing, Amharic WordNet, Word Sense Disambiguation,
Knowledge Based Approach, Lesk Algorithm
Named Entity Recognition for Telugu Using Conditional Random FieldWaqas Tariq
Named Entity (NE) recognition is a task in which proper nouns and numerical information are extracted from documents and are classified into predefined categories such as Person names, Organization names , Location names, miscellaneous(Date and others). It is a key technology of Information Extraction, Question Answering system, Machine Translations, Information Retrial etc. This paper reports about the development of a NER system for Telugu using Conditional Random field (CRF). Though this state of the art machine learning technique has been widely applied to NER in several well-studied languages, the use of this technique to Telugu languages is very new. The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the four different named entities (NE) classes, such as Person name, Location name, Organization name, miscellaneous (Date and others). Keywords: Named entity, Conditional Random field, NE, CRF, NER, named entity recognition
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
Natural Language Processing (NLP) techniques are one of the most used techniques in the field of computer applications. It has become one of the vast and advanced techniques. Language is the means of communication or interaction among humans and in present scenario when everything is dependent on machine or everything is computerized, communication between computer and human has become a necessity. To fulfill this necessity NLP has been emerged as the means of interaction which narrows the gap between machines (computers) and humans. It was evolved from the study of linguistics which was passed through the Turing test to check the similarity between data but it was limited to small set of data. Later on various algorithms were developed along with the concept of AI (Artificial Intelligence) for the successful execution of NLP. In this paper, the main emphasis is on the different techniques of NLP which have been developed till now, their applications and the comparison of all those techniques on different parameters.
Natural Language Toolkit (NLTK) is a generic platform to process the data of various natural (human)
languages and it provides various resources for Indian languages also like Hindi, Bangla, Marathi and so
on. In the proposed work, the repositories provided by NLTK are used to carry out the processing of Hindi
text and then further for analysis of Multi word Expressions (MWEs). MWEs are lexical items that can be
decomposed into multiple lexemes and display lexical, syntactic, semantic, pragmatic and statistical
idiomaticity. The main focus of this paper is on processing and analysis of MWEs for Hindi text. The
corpus used for Hindi text processing is taken from the famous Hindi novel “KaramaBhumi by Munshi
PremChand”. The result analysis is done using the Hindi corpus provided by Resource Centre for Indian
Language Technology Solutions (CFILT). Results are analysed to justify the accuracy of the proposed
work.
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig
Many Natural Language Processing (NLP) applications involve Named Entity Recognition (NER) as an important task, where it leads to improve the overall performance of NLP applications. In this paper the Deep learning techniques are used to perform NER task on Hindi text data as it found that as compared to English NER, Hindi language NER is not sufficiently done. This is a barrier for resource-scarce languages as many resources are not readily available. Many researchers use various techniques such as rule based, machine learning based and hybrid approaches to solve this problem. Deep learning based algorithms are being developed in large scale as an innovative approach now a days for the advanced NER models which will give the best results out of it. In this paper we devise a Novel architecture based on residual network architecture for preferably Bidirectional Long Short Term Memory (BiLSTM) with fasttext word embedding layers. For this purpose we use pre-trained word embedding to represent the words in the corpus where the NER tags of the words are defined as the used annotated corpora. BiLSTM Development of an NER system for Indian languages is a comparatively difficult task. In this paper, we have done the various experiments to compare the results of NER with normal embedding and fasttext embedding layers to analyse the performance of word embedding with different batch sizes to train the deep learning models. Here we present a state-of-the-art results with said approach F1 Score measures.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...ijtsrd
Natural Language Processing NLP is the one of the major filed of Natural Language Generation NLG . NLG can generate natural language from a machine representation. Generating suggestions for a sentence especially for Indian languages is much difficult. One of the major reason is that it is morphologically rich and the format is just reverse of English language. By using deep learning approach with the help of Long Short Term Memory LSTM layers we can generate a possible set of solutions for erroneous part in a sentence. To effectively generate a bunch of sentences having equivalent meaning as the original sentence using Deep Learning DL approach is to train a model on this task, e.g. we need thousands of examples of inputs and outputs with which to train a model. Veena S Nair | Amina Beevi A ""Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Learning"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd23842.pdf
Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/23842/suggestion-generation-for-specific-erroneous-part-in-a-sentence-using-deep-learning/veena-s-nair
Exploiting rules for resolving ambiguity in marathi language texteSAT Journals
Abstract
Natural language ambiguity is a situation involving some words having multiple meanings/senses. This paper discusses natural
language ambiguity and its types. Further we propose a knowledge based solution to resolve various types of ambiguity occurring
in Marathi language text. The task of resolving semantic and lexical ambiguity occurring in words to obtain the actual sense is
referred as Word Sense Disambiguation (WSD). Marathi language is the official and commonly spoken language of Maharashtra
state in India. Plenty of words in Marathi are spelled same as well as uttered same but are semantically (meaning-wise/ sensewise)
different. During the automatic translation, these words lead to ambiguity. Our method successfully removes the ambiguity
by identifying the correct sense of the given text from the predefined possible senses available in Marathi Wordnet using word and
sentence rules. The method is applicable only for word level ambiguity. Structural ambiguity is not handled by this system. This
system may be successfully used as a subsystem in other Natural Language Processing (NLP) applications.
Key Words: Word Sense Disambiguation, Natural Language Processing, Marathi, Marathi Wordnet, ambiguity,
knowledge based
Parameters Optimization for Improving ASR Performance in Adverse Real World N...Waqas Tariq
From the existing research it has been observed that many techniques and methodologies are available for performing every step of Automatic Speech Recognition (ASR) system, but the performance (Minimization of Word Error Recognition-WER and Maximization of Word Accuracy Rate- WAR) of the methodology is not dependent on the only technique applied in that method. The research work indicates that, performance mainly depends on the category of the noise, the level of the noise and the variable size of the window, frame, frame overlap etc is considered in the existing methods. The main aim of the work presented in this paper is to use variable size of parameters like window size, frame size and frame overlap percentage to observe the performance of algorithms for various categories of noise with different levels and also train the system for all size of parameters and category of real world noisy environment to improve the performance of the speech recognition system. This paper presents the results of Signal-to-Noise Ratio (SNR) and Accuracy test by applying variable size of parameters. It is observed that, it is really very hard to evaluate test results and decide parameter size for ASR performance improvement for its resultant optimization. Hence, this study further suggests the feasible and optimum parameter size using Fuzzy Inference System (FIS) for enhancing resultant accuracy in adverse real world noisy environmental conditions. This work will be helpful to give discriminative training of ubiquitous ASR system for better Human Computer Interaction (HCI). Keywords: ASR Performance, ASR Parameters Optimization, Multi-Environmental Training, Fuzzy Inference System for ASR, ubiquitous ASR system, Human Computer Interaction (HCI)
Phrase Structure Identification and Classification of Sentences using Deep Le...ijtsrd
Phrase structure is the arrangement of words in a specific order based on the constraints of a specified language. This arrangement is based on some phrase structure rules which are according to the productions in context free grammar. The identification of the phrase structure can be done by breaking the specified natural language sentence into its constituents that may be lexical and phrasal categories. These phrase structures can be identified using parsing of the sentences which is nothing but syntactic analysis. The proposed system deals with this problem using Deep Learning strategy. Instead of using Rule Based technique, supervised learning with sequence labelling is done using IOB labelling. This is a sequence classification problem which has been trained and modeled using RNN LSTM. The proposed work has shown a considerable result and can be applied in many applications of NLP. Hashi Haris | Misha Ravi ""Phrase Structure Identification and Classification of Sentences using Deep Learning"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd23841.pdf
Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/23841/phrase-structure-identification-and-classification-of-sentences-using-deep-learning/hashi-haris
A comparative analysis of particle swarm optimization and k means algorithm f...ijnlc
The volume of digitized text documents on the web have been increasing rapidly. As there is huge collection
of data on the web there is a need for grouping(clustering) the documents into clusters for speedy
information retrieval. Clustering of documents is collection of documents into groups such that the
documents within each group are similar to each other and not to documents of other groups. Quality of
clustering result depends greatly on the representation of text and the clustering algorithm. This paper
presents a comparative analysis of three algorithms namely K-means, Particle swarm Optimization (PSO)
and hybrid PSO+K-means algorithm for clustering of text documents using WordNet. The common way of
representing a text document is bag of terms. The bag of terms representation is often unsatisfactory as it
does not exploit the semantics. In this paper, texts are represented in terms of synsets corresponding to a
word. Bag of terms data representation of text is thus enriched with synonyms from WordNet. K-means,
Particle Swarm Optimization (PSO) and hybrid PSO+K-means algorithms are applied for clustering of
text in Nepali language. Experimental evaluation is performed by using intra cluster similarity and inter
cluster similarity.
A prior case study of natural language processing on different domain IJECEIAES
In the present state of digital world, computer machine do not understand the human’s ordinary language. This is the great barrier between humans and digital systems. Hence, researchers found an advanced technology that provides information to the users from the digital machine. However, natural language processing (i.e. NLP) is a branch of AI that has significant implication on the ways that computer machine and humans can interact. NLP has become an essential technology in bridging the communication gap between humans and digital data. Thus, this study provides the necessity of the NLP in the current computing world along with different approaches and their applications. It also, highlights the key challenges in the development of new NLP model.
KANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATIONijnlc
Named Entity Recognition and Classification (NERC) is a process of identification of proper nouns in the text and classification of those nouns into certain predefined categories like person name, location,organization, date, and time etc. NERC in Kannada is an essential and challenging task. The aim of this work is to develop a novel model for NERC, based on Multinomial Naïve Bayes (MNB) Classifier. The Methodology adopted in this paper is based on feature extraction of training corpus, by using term frequency, inverse document frequency and fitting them to a tf-idf-vectorizer. The paper discusses the
various issues in developing the proposed model. The details of implementation and performance evaluation are discussed. The experiments are conducted on a training corpus of size 95,170 tokens and test corpus of 5,000 tokens. It is observed that the model works with Precision, Recall and F1-measure of
83%, 79% and 81% respectively.
Author Credits - Maaz Anwar Nomani
Semantic Role Labeler (SRL) is a semantic parser which can automatically identify and then classify arguments of a verb in a natural language sentence for Hindi and Urdu. For e.g. in the natural language sentence “Sara won the competition because of her hard work.”, ‘won’ is the main verb and there are 3 arguments for this verb; ‘Sara’ (Agent), ‘hard work’ (Reason) and ‘competition’ (Theme). The problem statement of a SRL revolves around the fact that how will you make a machine identify and then classify the arguments of a verb in a natural language sentence.
Since there are 2 sub problem statements here (Identification and Classification), our SRL has a pipeline architecture in which a binary classifier (Logistic Regression) is first trained to identify whether a word is an argument to a verb in a sentence or not (Yes or No) and subsequently a multi-class classifier (SVM with Linear kernel) is trained to classify the identified arguments by above binary classifier into one of the 20 classes. These 20 classes are the various notions present in a natural language sentence (for e.g. Agent, Theme, Location, Time, Purpose, Reason, Cause etc.). These ‘notions’ are called Propbank labels or semantic labels present in a Proposition Bank which is a collection of hand-annotated sentences.
In essence, SRL felicitates Semantic Parsing which essentially is the research investigation of identifying WHO did WHAT to WHOM, WHERE, HOW, WHY and WHEN etc. in a natural language sentence.
Sentiment analysis is an important current research area. The demand for sentiment analysis and classification is growing day by day; this paper presents a novel method to classify Urdu documents as previously no work recorded on sentiment classification for Urdu text. We consider the problem by determining whether the review or sentence is positive, negative or neutral. For the purpose we use two machine learning methods Naïve Bayes and Support Vector Machines (SVM) . Firstly the documents are preprocessed and the sentiments features are extracted, then the polarity has been calculated, judged and classify through Machine learning methods.
Abstract This paper represents a Semantic Analyzer for checking the semantic correctness of the given input text. We describe our system as the one which analyzes the text by comparing it with the meaning of the words given in the WordNet. The Semantic Analyzer thus developed not only detects and displays semantic errors in the text but it also corrects them. Keywords: Part of Speech (POS) Tagger, Morphological Analyzer, Syntactic Analyzer, Semantic Analyzer, Natural Language (NL)
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Words can have more than one distinct meaning and many words can be interpreted in multiple ways
depending on the context in which they occur. The process of automatically identifying the meaning of
a polysemous word in a sentence is a fundamental task in Natural Language Processing (NLP). This
phenomenon poses challenges to Natural Language Processing systems. There have been many efforts
on word sense disambiguation for English; however, the amount of efforts for Amharic is very little.
Many natural language processing applications, such as Machine Translation, Information Retrieval,
Question Answering, and Information Extraction, require this task, which occurs at the semantic level.
In this thesis, a knowledge-based word sense disambiguation method that employs Amharic WordNet
is developed. Knowledge-based Amharic WSD extracts knowledge from word definitions and relations
among words and senses. The proposed system consists of preprocessing, morphological analysis and
disambiguation components besides Amharic WordNet database. Preprocessing is used to prepare the
input sentence for morphological analysis and morphological analysis is used to reduce various forms
of a word to a single root or stem word. Amharic WordNet contains words along with its different
meanings, synsets and semantic relations with in concepts. Finally, the disambiguation component is
used to identify the ambiguous words and assign the appropriate sense of ambiguous words in a
sentence using Amharic WordNet by using sense overlap and related words.
We have evaluated the knowledge-based Amharic word sense disambiguation using Amharic
WordNet system by conducting two experiments. The first one is evaluating the effect of Amharic
WordNet with and without morphological analyzer and the second one is determining an optimal
windows size for Amharic WSD. For Amharic WordNet with morphological analyzer and Amharic
WordNet without morphological analyzer we have achieved an accuracy of 57.5% and 80%,
respectively. In the second experiment, we have found that two-word window on each side of the
ambiguous word is enough for Amharic WSD. The test results have shown that the proposed WSD
methods have performed better than previous Amharic WSD methods.
Keywords: Natural Language Processing, Amharic WordNet, Word Sense Disambiguation,
Knowledge Based Approach, Lesk Algorithm
In this paper, we made a survey on Word Sense Disambiguation (WSD). Near about in all major languages
around the world, research in WSD has been conducted upto different extents. In this paper, we have gone
through a survey regarding the different approaches adopted in different research works, the State of the
Art in the performance in this domain, recent works in different Indian languages and finally a survey in
Bengali language. We have made a survey on different competitions in this field and the bench mark
results, obtained from those competitions.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Improvement wsd dictionary using annotated corpus and testing it with simplif...csandit
WSD is a task with a long history in computational linguistics. It is open problem in NLP. This research focuses on increasing the accuracy of Lesk algorithm with assistant of annotated corpus using Narodowy Korpus Jezyka Polskiego (NKJP “Polish National Corpus”). The
NKJP_WSI (NKJP Word Sense Inventory) is used as senses inventory. A Lesk algorithm is firstly implemented on the whole corpus (training and test) and then getting the results. This is done with assistance of special dictionary that contains all possible senses for each ambiguous
word. In this implementation, the similarity equation is applied to information retrieval using tfidf with small modification in order to achieve the requirements. Experimental results show that the accuracy of 82.016% and 84.063% without and with deleting stop words respectively. Moreover, this paper practically solves the challenge of an execution time. Therefore, we proposed special structure for building another dictionary from the corpus in order to reduce time complicity of the training process. The new dictionary contains all the possible words (only these which help us in solving WSD) with their tf-idf from the existing dictionary with assistant of annotated corpus. Furthermore, eexperimental results show that the two tests are identical. The execution time - of the second test dropped down to 20 times compared to first test with same accuracy
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES ijnlc
Stemming is the process of term conflation. It conflates all the word variants to a common form called as stem. It plays significant role in numerous Natural Language Processing (NLP) applications like morphological analysis, parsing, document summarization, text classification, part-of-speech tagging, question-answering system, machine translation, word sense disambiguation, information retrieval (IR), etc. Each of these tasks requires some pre-processing to be done. Stemming is one of the important building blocks for all these applications. This paper, presents an overview of various stemming techniques, evaluation criteria for stemmers and various existing stemmers for Indic languages.
In recent years, great advances have been made in the speed, accuracy, and coverage of automatic word
sense disambiguator systems that, given a word appearing in a certain context, can identify the sense of
that word. In this paper we consider the problem of deciding whether same words contained in different
documents are related to the same meaning or are homonyms. Our goal is to improve the estimate of the
similarity of documents in which some words may be used with different meanings. We present three new
strategies for solving this problem, which are used to filter out homonyms from the similarity computation.
Two of them are intrinsically non-semantic, whereas the other one has a semantic flavor and can also be
applied to word sense disambiguation. The three strategies have been embedded in an article document
recommendation system that one of the most important Italian ad-serving companies offers to its customers
In recent years, great advances have been made in the speed, accuracy, and coverage of automatic word
sense disambiguator systems that, given a word appearing in a certain context, can identify the sense of
that word. In this paper we consider the problem of deciding whether same words contained in different
documents are related to the same meaning or are homonyms. Our goal is to improve the estimate of the
similarity of documents in which some words may be used with different meanings. We present three new
strategies for solving this problem, which are used to filter out homonyms from the similarity computation.
Two of them are intrinsically non-semantic, whereas the other one has a semantic flavor and can also be
applied to word sense disambiguation. The three strategies have been embedded in an article document
recommendation system that one of the most important Italian ad-serving companies offers to its customers.
Word sense disambiguation using wsd specific wordnet of polysemy wordsijnlc
This paper presents a new model of WordNet that is used to disambiguate the correct sense of polysemy
word based on the clue words. The related words for each sense of a polysemy word as well as single sense
word are referred to as the clue words. The conventional WordNet organizes nouns, verbs, adjectives and
adverbs together into sets of synonyms called synsets each expressing a different concept. In contrast to the
structure of WordNet, we developed a new model of WordNet that organizes the different senses of
polysemy words as well as the single sense words based on the clue words. These clue words for each sense
of a polysemy word as well as for single sense word are used to disambiguate the correct meaning of the
polysemy word in the given context using knowledge based Word Sense Disambiguation (WSD) algorithms.
The clue word can be a noun, verb, adjective or adverb.
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
Source and target word segmentation and alignment is a primary step in the statistical learning of a Transliteration. Here, we analyze the benefit of a syllable-like segmentation approach for learning a transliteration from English to an Indic language, which aligns the training set word pairs in terms of sub-syllable-like units instead of individual character units. While this has been found useful in the case of dealing with Out-of-vocabulary words in English-Chinese in the presence of multiple target dialects, we asked if this would be true for Indic languages which are simpler in their phonetic representation and pronunciation. We expected this syllable-like method to perform marginally better, but we found instead that even though our proposed approach improved the Top-1 accuracy, the individual-character-unit alignment model
somewhat outperformed our approach when the Top-10 results of the system were re-ranked using language modeling approaches. Our experiments were conducted for English to Telugu transliteration (our method will apply equally well to most written Indic languages); our training consisted of a syllable-like segmentation and alignment of a large training set, on which we built a statistical model by modifying a previous character-level maximum entropy based Transliteration learning system due to Kumaran and Kellner; our testing consisted of using the same segmentation of a test English word, followed by applying the model, and reranking the resulting top 10 Telugu words. We also report the dataset creation and selection since standard datasets are not available.
A decision tree based word sense disambiguation system in manipuri languageacijjournal
This paper manifests a primary attempt on building a word sense disambiguation system in Manipuri
language. The paper discusses related attempts made in the Manipuri language followed by the proposed
plan. A database, consisting of 650 sentences, is collected in Manipuri language in the course of the study.
Conventional positional and context based features are suggested to capture the sense of the words, which
have ambiguous and multiple senses. The proposed work is expected to predict the senses of the
polysemous words with high accuracy with the help of the suitable knowledge acquisition techniques. The
system produces an accuracy of 71.75 %.
A performance of svm with modified lesk approach for word sense disambiguatio...eSAT Journals
Abstract WSD is a Technique used to find the correct meaning of a given word in any human language. Each human language has a problem called ambiguity of a word. To finds the correct meaning of any ambiguous word is easy for human but for a machine it is great issues No of work has done on WSD but not enough in Hindi language. My objective is to provide the training to the system so that it can easily find the correct meaning of the any ambiguous word in Hindi language .For this purpose I simple used a one existing technique name modified lesk approach and give its output to the SVM to get the better result and show that SVM is better in compare to modified Lesk Approach, In this paper I simply take nine Hindi ambiguous words and three different databases to show the result. Keywords: Support Vector Machine, NLP, Word Sense Disambiguation, Modified Lesk approach, Comparison
Continuous bag of words cbow word2vec word embedding work .pdfdevangmittal4
Continuous bag of words (cbow) word2vec word embedding work is that it tends to predict the
probability of a word given a context. A context may be a single word or a group of words. But for
simplicity, I will take a single context word and try to predict a single target word.
The purpose of this question is to be able to create a word embedding for the given data set.
data set text:
In linguistics word embeddings were discussed in the research area of distributional semantics. It
aims to quantify and categorize semantic similarities between linguistic items based on their
distributional properties in large samples of language data. The underlying idea that "a word is
characterized by the company it keeps" was popularized by Firth.
The technique of representing words as vectors has roots in the 1960s with the development of
the vector space model for information retrieval. Reducing the number of dimensions using
singular value decomposition then led to the introduction of latent semantic analysis in the late
1980s.In 2000 Bengio et al. provided in a series of papers the "Neural probabilistic language
models" to reduce the high dimensionality of words representations in contexts by "learning a
distributed representation for words". (Bengio et al, 2003). Word embeddings come in two different
styles, one in which words are expressed as vectors of co-occurring words, and another in which
words are expressed as vectors of linguistic contexts in which the words occur; these different
styles are studied in (Lavelli et al, 2004). Roweis and Saul published in Science how to use
"locally linear embedding" (LLE) to discover representations of high dimensional data structures.
The area developed gradually and really took off after 2010, partly because important advances
had been made since then on the quality of vectors and the training speed of the model.
There are many branches and many research groups working on word embeddings. In 2013, a
team at Google led by Tomas Mikolov created word2vec, a word embedding toolkit which can train
vector space models faster than the previous approaches. Most new word embedding techniques
rely on a neural network architecture instead of more traditional n-gram models and unsupervised
learning.
Limitations
One of the main limitations of word embeddings (word vector space models in general) is that
possible meanings of a word are conflated into a single representation (a single vector in the
semantic space). Sense embeddings are a solution to this problem: individual meanings of words
are represented as distinct vectors in the space.
For biological sequences: BioVectors
Word embeddings for n-grams in biological sequences (e.g. DNA, RNA, and Proteins) for
bioinformatics applications have been proposed by Asgari and Mofrad. Named bio-vectors
(BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins
(amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representa.
NBLex: emotion prediction in Kannada-English code-switchtext using naïve baye...IJECEIAES
Emotion analysis is a process of identifying the human emotions derived from the various data sources. Emotions can be expressed either in monolingual text or code-switch text. Emotion prediction can be performed through machine learning (ML), or deep learning (DL), or lexicon-based approach. ML and DL approaches are computationally expensive and require training data. Whereas, the lexicon-based approach does not require any training data and it takes very less time to predict the emotions in comparison with ML and DL. In this paper, we proposed a lexicon-based method called non-binding lower extremity exoskeleton (NBLex) to predict the emotions associated with Kannada-English code-switch text that no one has addressed till now. We applied the One-vs-Rest approach to generate the scores for lexicon and also to predict the emotions from the code-switch text. The accuracy of the proposed model NBLex (87.9%) is better than naïve bayes (NB) (85.8%) and bidirectional long short-term memory neural network (BiLSTM) (84.7%) and for true positive rate (TPR), the NBLex (50.6%) is better than NB (37.0%) and BiLSTM (42.2%). From our approach, it is observed that a simple additive model (lexicon approach) can also be an alternative model to predict the emotions in code-switch text.
This paper proposes Natural language based Discourse Analysis method used for extracting
information from the news article of different domain. The Discourse analysis used the Rhetorical Structure
theory which is used to find coherent group of text which are most prominent for extracting information
from text. RST theory used the Nucleus- Satellite concept for finding most prominent text from the text
document. After Discourse analysis the text analysis has been done for extracting domain related object
and relates this object. For extracting the information knowledge based system has been used which
consist of domain dictionary .The domain dictionary has a bag of words for domain. The system is
evaluated according gold-of-art analysis and human decision for extracted information.
Similar to Automatic classification of bengali sentences based on sense definitions present in bengali wordnet (20)
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Automatic classification of bengali sentences based on sense definitions present in bengali wordnet
1. International Journal of Control Theory and Computer Modeling (IJCTCM) Vol.5, No.1, January 2015
DOI : 10.5121/ijctcm.2015.5101 1
AUTOMATIC CLASSIFICATION OF BENGALI
SENTENCES BASED ON SENSE DEFINITIONS
PRESENT IN BENGALI WORDNET
Alok Ranjan Pal1
, Diganta Saha2
and Niladri Sekhar Dash3
1
Dept. of Computer Science and Eng., College of Engineering and Management,
Kolaghat
2
Dept. of Computer Science and Eng., Jadavpur University, Kolkata
3
Linguistic Research Unit, Indian Statistical Institute, Kolkata
ABSTRACT
Based on the sense definition of words available in the Bengali WordNet, an attempt is made to classify the
Bengali sentences automatically into different groups in accordance with their underlying senses. The input
sentences are collected from 50 different categories of the Bengali text corpus developed in the TDIL
project of the Govt. of India, while information about the different senses of particular ambiguous lexical
item is collected from Bengali WordNet. In an experimental basis we have used Naive Bayes probabilistic
model as a useful classifier of sentences. We have applied the algorithm over 1747 sentences that contain a
particular Bengali lexical item which, because of its ambiguous nature, is able to trigger different senses
that render sentences in different meanings. In our experiment we have achieved around 84% accurate
result on the sense classification over the total input sentences. We have analyzed those residual sentences
that did not comply with our experiment and did affect the results to note that in many cases, wrong
syntactic structures and less semantic information are the main hurdles in semantic classification of
sentences. The applicational relevance of this study is attested in automatic text classification, machine
learning, information extraction, and word sense disambiguation.
KEYWORDS
Natural Language Processing, Bengali Word Sense Disambiguation, Bengali WordNet, Naïve Bayes
Classification.
1. INTRODUCTION
In all natural languages, there are a lot of words that denote different meanings based on the
contexts of their use within texts. Since it is not easy to capture the actual intended meaning of a
word in a piece of text, we need to apply Word Sense Disambiguation (WSD) [1-6] technique for
identification of actual meaning of a word based on its distinct contextual environments. For
example in English, the word ‘goal’ may denote several senses based on its use in different types
of construction, such as He scored a goal, It was his goal in life, etc. Such words with multiple
meanings are ambiguous in nature and they posit serious challenges in understanding a natural
language text both by man and machine.
2. International Journal of Control Theory and Computer Modeling (IJCTCM) Vol.5, No.1, January 2015
2
The act of identifying the most appropriate sense of an ambiguous word in a particular syntactic
context is known as WSD. A normal human being, due to her innate linguistic competence, is
able to capture the actual contextual sense of an ambiguous word within a specific syntactic
frame with the knowledgebase triggered from various intra- and extra-linguistic environments.
Since a machine does not possess such capacities and competence, it requires some predefined
rules or statistical methods to do this job successfully.
Normally, two types of learning procedure are used for WSD. The first one is Supervised
Learning, where a learning set is considered for the system to predict the actual meaning of an
ambiguous word within a syntactic frame in which the specific meaning for that particular word
is embedded. The system tries to capture contextual meaning of the ambiguous word based on
that defined learning set. The other one is Unsupervised Learning where dictionary information
(i.e., glosses) of the ambiguous word is used to do the same task. In most cases, since digital
dictionaries with information of possible sense range of words are not available, the system
depends on on-line dictionaries like WordNet [7-13] or SenseNet.
Adopting the technique used in Unsupervised Learning, we have used the Naive Bayes [14]
probabilistic measure to mark the sentence structure. Besides, we have a Bengali WordNet and a
standard Bengali dictionary to capture the actual sense a word generates in a normal Bengali
sentence.
The organization of the paper is as follows: in Section 2, we present a short review of some
earlier works; in Section 3, we refer to the key features of Bengali morphology with reference to
English; in Section 4, we present an overview of English and Bengali WordNet; in Section 5, we
refer to the Bengali corpus we have used for our study; in Section 6, we explain the approach we
have adopt for our work, in Section 7, we present the results and corresponding explanations; in
Section 8, we present some close observations on our study, and in Section 9, we infer conclusion
and redirect attention towards future direction of this research.
2. REVIEW OF EARLIER WORKS
WSD is perhaps one of the greatest open problems at lexical level of Natural Language
Processing (Resnik and Yarowsky 1997). Several approaches have been established in different
languages for assigning correct sense to an ambiguous word in a particular context (Gaizauskas
1997, Ide & Véronis 1998, Cucerzan, Schafer & Yarowsky 2002). Along with English, works
have been done in many other languages like Dutch, Italian, Spanish, French, German, Japanese,
Chinese, etc. (Xiaojie & Matsumoto 2003, Cañas, Valerio, Lalinde-Pulido, Carvalho & Arguedas
2003, Seo, Chung, Rim, Myaeng & Kim 2004, Liu, Scheuermann, Li & Zhu 2007, Kolte &
Bhirud 2008, Navigli 2009, Nameh, Fakhrahmad & Jahromi (2011). And in most cases, they have
achieved high level of accuracy in their works.
For Indian languages like Hindi, Bengali, Marathi, Tamil, Telugu, Malayalam, etc., effort for
developing WSD system has not been much successful due to several reasons. One of the reasons
is the morphological complexities of words of these languages. Words are morphologically so
complex that there is no benchmark work in these languages (especially in Bengali). Keeping this
reality in mind we have made an attempt to disambiguate word sense in Bengali. We believe this
attempt will lead us to the destination through the tricky terrains of trial and error.
3. International Journal of Control Theory and Computer Modeling (IJCTCM) Vol.5, No.1, January 2015
3
In essence, any WSD system typically involves two major tasks: (a) determining the different
possible senses of an ambiguous word, and (b) assigning the word with its most appropriate sense
in a particular context where it is used. The first task needs a Machine Readable Dictionary
(MRD) to determine the different possible senses of an ambiguous word. At this moment, the
most important sense repository used by the NLP community is the WordNet, which is being
developed for all major languages of the world for language specific WSD task as well as for
other linguistic works. The second task involves assigning each polysemic word with its
appropriate sense in a particular context.
The WSD procedures so far used across languages may be classified into two broad types: (i)
knowledge-based methods, and (ii) corpus-based methods. The knowledge-based methods obtain
information from external knowledge sources, such as, Machine Readable Dictionaries (MRDs)
and lexico-semantic ontologies. On the contrary, corpus-based methods gather information from
the contexts of previously annotated instances (examples) of words. These methods extract
knowledge from the examples applying some statistical or machine learning algorithms. When
the examples are previously hand-tagged, the methods are called supervised learning and when
the examples do not come with the sense labels they are called unsupervised learning.
2.1 Knowledge-based Methods
These methods do not depend on large amount of training materials as required in supervised
methods. Knowledge-based methods can be classified further according to the type of resources
they use: Machine-Readable Dictionaries (Lesk 1986); Thesauri (Yarowsky 1992);
Computational Lexicon or Lexical Knowledgebase (Miller et al. 1990).
2.2 Corpus-based Methods
The corpus-based methods also resolute the sense through a classification model of example
sentences. These methods involve two phases: learning and classification. The learning phase
builds a sense classification model from the training examples and the classification phase applies
this model to new instances (examples) for finding the sense.
2.3 Methods Based on Probabilistic Models
In recent times, we have come across cases where various statistics-based probabilistic models
are being used to carry out the same task. The statistical methods evaluate a set of probabilistic
parameters that express conditional probability of each lexical category given in a particular
context. These parameters are then combined in order to assign the set of categories that
maximizes its probability on new examples.
The Naive Bayes algorithm (Duda and Hart 1973) is the mostly used algorithm in this category,
which uses the Bayes rule to find out the conditional probabilities of features in a given class. It
has been used in many investigations of WSD task (Gale et al. 1992; Leacock et al. 1993;
Pedersen and Bruce 1997; Escudero et al. 2000; Yuret 2004).
In addition to these, there are also some other methods that are used in different language for
WSD task, such as, methods based on the similarity of examples (Schutze 1992), k-Nearest
4. International Journal of Control Theory and Computer Modeling (IJCTCM) Vol.5, No.1, January 2015
4
Neighbour algorithm (Ng and Lee 1996), methods based on discursive properties (Gale et al.
1992; Yarowsky 1995), and methods based on discriminating rules (Rivest 1987), etc.
3. KEY FEATURES OF BENGALI MORPHOLOGY
In English, compared to Indic languages, most of the words have limited morphologically derived
variants. Due to this factor it is comparatively easier to work on WSD in English as it does not
pose serial problems to deal with varied forms. For instance, the verb eat in English has five
conjugated (morphologically derived) forms only, namely, eat, eats, ate, eaten, and eating. On
the other hand, most of the Indian languages (e.g., Hindi, Bengali, Odia, Konkani, Gujarati,
Marathi, Punjabi, Tamil, Telugu, Kannada, Malayalam, etc.) are morphologically very rich,
varied and productive. As a result of this, we can derive more than hundred conjugated verb
forms from a single verb root. For instance, the Bengali verb o (khāoyā) “to eat” has more
than 150 conjugated forms including both calit (colloquial) and sādhu (chaste) forms, such as, i
(khai), (kkās), o (khāo), (khāy), (khān), c (khācchi), c (khācchis), c
(khāccha), c (khācchen), c (khācche), c (khāitechi), (kheyechi), (kheyecha),
(kheyechis), (kheyeche), (kheyechen), (khelam), (kheli),
(khele), (khela), (khelen), (khāba), (khābi), (khābe), (khāben), c
(khācchilām), c (khācchile), c (khācchila), c (khācchilen), c (khācchili), etc. (to
mention a few).
While nominal and adjectival morphology in Bengali is light (in the sense that the number of
derived forms from an adjective or a noun, is although quite large, in not up to the range of forms
derived from a verb), the verbs are highly inflected. In general, nouns are inflected according to
seven grammatical cases (nominative, accusative, instrumental, ablative, genitive, locative, and
vocative), two numbers (singular and plural), a few determiners like, - (-ṭā), - (-ṭi), - (-
khānā), - (-khāni), and a few emphatic markers, like -i (-i) and -o (-o), etc. The adjectives, on
the other hand, are normally inflected with some primary and secondary adjectival suffixes
denoting degree, quality, quantity, and similar other attributes. As a result, to build up a complete
and robust system for WSD for all types of morphologically derived forms tagged with lexical
information and semantic relations is a real challenge for a language like Bengali [15-25].
4. ENGLISH AND BENGALI WORDNET
The WordNet is a digital lexical resource, which organizes lexical information in terms of word
meanings. It is a system for bringing together different lexical and semantic relations between
words. In a language, a word may appear in more than one grammatical category and within that
grammatical category it can have multiple senses. These categories and all senses are captured in
the WordNet. WordNet supports the major grammatical categories, namely, Noun, Verb,
Adjective, and Adverb. All words which express the same sense (same meaning) are grouped
together to form a single entry in WordNet, called Synset (set of synonyms). Synsets are the basic
building blocks of WordNet. It represents just one lexical concept for each entry. WordNet is
developed to remove ambiguity in cases where a single word denotes more than one sense.
5. International Journal of Control Theory and Computer Modeling (IJCTCM) Vol.5, No.1, January 2015
5
The English WordNet [26] available at present contains a large list of synsets in which there are
117097 unique nouns, 11,488 verbs, 22141 adjectives, and 4601 adverbs (Miller, Beckwith,
Fellbaum, Gross, and Miller 1990, Miller 1993). The semantic relations for each grammatical
category, as maintained in this WordNet, may be understood from the following diagrams (Fig. 1
and Fig. 2):
Figure 1. Noun Relations in English WordNet
Figure 2. Verb relations in English WordNet
The Bengali WordNet [27] is also a similar type of digital lexical resource, which aims at
providing mostly semantic information for general conceptualization, machine learning and
knowledge representation in Bengali (Dash 2012). It provides information about Bengali words
from different angles and also gives the relationship(s) existing between words. The Bengali
WordNet is being developed using expansion approach with the help of tools provided by Indian
Institute of Technology (IIT) Bombay. In this WordNet, a user can search for a Bengali word and
get its meaning. In addition, it gives the grammatical category namely, noun, verb, adjective or
adverb of the word being searched. It is noted that a word may appear in more than one
grammatical category and a particular grammatical category can have multiple senses. The
WordNet also provides information for these categories and all senses for the word being
searched.
Apart from the category for each sense, the following set of information for a Bengali word is
presented in the WordNet:
(a) Meaning of the word,
(b) Example of use of the word
(c) Synonyms (words with similar meanings),
(d) Part-of-speech,
(e) Ontology (hierarchical semantic representation),
6. International Journal of Control Theory and Computer Modeling (IJCTCM) Vol.5, No.1, January 2015
6
(f) Semantic and lexical relations.
At present the Bengali WordNet contains 36534 words covering all major lexical categories,
namely, noun, verb, adjective, and adverb.
5. THE BENGALI CORPUS
The Bengali corpus that is used in this work is developed under the TDIL (Technology
Development for the Indian Languages) project, Govt. of India (Dash 2007). This corpus contains
text samples from 85 text categories or subject domains like Physics, Chemistry, Mathematics,
Agriculture, Botany, Child Literature, Mass Media, etc. (Table 1) covering 11,300 A4 pages,
271102 sentences and 3589220 non-tokenized words in their inflected and non-inflected forms.
Among these total words there are 199245 tokens (i.e., distinct words) each of which appears in
the corpus with different frequency of occurrence. For example while the word (māthā)
“head” occurs 968 times, (māthāy) “on head” occurs 729 times, (māthār) “of head”
occurs 398 times followed by other inflected forms like (māthāte) “in head”, (māthāṭā)
“the head”, (māthāṭi) “the head”, (māthāgulo) “heads”, (māthārā) “heads”,
(māthāder) “to the heads”, i (māthāri) “of head itself” with moderate frequency. This
corpus is exhaustively used to extract sentences of a particular word required for our system as
well as for validating the senses evoked by the word used in the sentences.
6. PROPOSED APPROACH
In the proposed approach, we have used Naive Bayes probabilistic model to classify the sentences
based on some previously tagged learning sets. We have tested the efficiency of the algorithm
over the Bengali corpus data stated above. In this approach we have used a sequence of steps
(Fig. 3) to disambiguate the sense of māthā (head) – one of the most common ambiguous words
in Bengali. The category-wise results are explained in results and evaluation section.
6.1 Text annotation
At first, all the sentences containing the word ‘māthā’ are extracted from the Bengali text corpus
(Section 5). Total number of sentence counts: 1747. That means there are at least 1747 sentences
in this particular corpus where the word (māthā) has been used in its non-inflected and non-
compounded lemma form. However, since the sentences extracted from the corpus are not
normalized adequately, these are passed through a series of manual normalization for (a)
separation or detachment of punctuation marks like single quote, double quote, parenthesis,
comma, etc. that are attached to words; (b) conversion of dissimilar fonts into similar ones; (c)
removal of angular brackets, uneven spaces, broken lines, slashes, etc. from sentences; and (d)
identification of sentence terminal markers (i.e., full stop, note of exclamation, and note of
interrogation) that are used in written Bengali texts.
6.2 Stop word removal
The very next stage of our strategy was the removal of stop words. Based on traditional definition
and argument, we have identified all postpositions, e.g., (dike) “towards”, p (prati) “per”,
etc.; conjunctions, e.g., e (ebang) “and”, n (kintu) “but”, etc.; interjections, e.g., ! (bāh)
7. International Journal of Control Theory and Computer Modeling (IJCTCM) Vol.5, No.1, January 2015
7
“well”, (āhā) “ah!”, etc.; pronouns, e.g., (āmi) “I”, (tumi) “you”, (se) “she” etc.;
some adjectives, e.g., (lāl) “red”, (bhālo) “good”, etc.; some adverbs, e.g., (khub)
“very”, (satyi) “really”, etc.; all articles, e.g., e (ekṭi) “one”, etc. and proper nouns, e.g.,
(rām) “Ram”, (kalkātā) “Calcutta”, etc. as stop words in Bengali.
To identify stop words the first step was to measure frequency of use of individual words, which
we assumed, would have helped us to identify stop words. However, since the term frequencies of
stop words were either very high or very low, it was not possible to set a particular threshold
value for filtering out stop words. So, in our case, stop words are manually tracked and tagged
with the help of a standard Bengali dictionary.
6.3 Learning Procedure
As the proposed approach is based on supervised learning methodology, it is necessary to build
up a strong learning set before testing a new data set. In our approach, we have therefore used
three types of learning sets that are built up according to three different meanings of the
ambiguous word (māthā) “head”, which are collected from the Bengali WordNet. There are
five types of dictionary definition (glosses) for the word with twenty five (25) varied senses in the
WordNet, such as the followings:
i. Category: Noun
Synonyms: , s , ;
Concept: uপ e a
Example: c c
ii. Category: Noun
Synonyms: , ;
Concept: uপ i a , , , i a e
s
Example: p o প
iii. Category: Noun
Synonyms:
Concept: i a s
Example: i
iv. Category: Noun
Synonyms:
Concept: я ag
Example: o я
v. Category: Noun
Synonyms:
Concept: u , p
Example: , i
It is observed from the categories that the first category represent first type of meaning ( = s
= ), 2nd
and 3rd
category represent second type of meaning ( = ), 4th
and 5th
category
represent third type of meaning (ag = = p।n ). Based on the information types we have
8. International Journal of Control Theory and Computer Modeling (IJCTCM) Vol.5, No.1, January 2015
8
built up three specific categories of senses of (māthā): (a) “ s , ”, (b) “ , ” and (c)
“ag , , p।n ”. After this we have randomly chosen 55 sentences of each type of sense from
the corpus to build the learning sets.
6.4 Modular representation of the proposed approach
In this proposed approach all the sentences containing (māthā) in the Bengali corpus are
classified in three pre-defined categories using Naive Bayes (NB) classifier in the following
manner (Fig. 3):
Figure 3. Overall procedure is represented graphically
6.4.1 Explanation of Module 1: Building NB model
In the NB model the following parameters are calculated based on the training documents:
• |V| = the number of vocabularies, means the total number of distinct words belong to all
the training sentences.
• P(ci) = the priori probability of each class, means the number of sentences in a class /
number of all the sentences.
• ni = the total number of word frequency of each class.
• P(wi | ci) = the conditional probability of keyword occurrence in a given class.
To avoid “zero frequency” problem, we have applied Laplace estimation by assuming a uniform
distribution over all words, as-
P(wi | ci) = (Number of occurrences of each word in a given class + 1)/(ni + |V|)
6.4.2 Explanation of Module 2: Classifying a test document
To classify a test document, the “posterior” probabilities, P(ci | W) for each class is calculated, as-
P(ci | W) = P(ci) x
The highest value of probability categorizes the test document into the related classifier.
7. RESULTS AND CORRESPONDING EVALUATIONS
We performed testing operation over 271102 sentences of the Bengali corpus. As mentioned in
section 6.1, this corpus consists of total 85 text categories of data sets like Agriculture, Botany,
Child Literature, etc. in which there are 1747sentences containing the particular ambiguous word
Bengali corpus of 271102
Sentences, containing “ ” are
Un-annotated 1747 sentences
Sentences are
Module 1: Building the NB
Module 2: Classifying a test
Sentences are classified into
three pre-defined categories
9. International Journal of Control Theory and Computer Modeling (IJCTCM) Vol.5, No.1, January 2015
9
(māthā). After annotation (Section 6.1), each individual sentence is passed through the Naive-
Bayes model and the “posterior” probabilities, P(ci | W) for each sentence is evaluated. Greater
probability represents the associated sense for that particular sentence.
The category-wise result is furnished in Table 1. The performance of our system is measured on
Precision, Recall and F-Measure parameters, stated as-
Precision (P) = Number of instances, responded by the system/ total number of sentences present
in the corpus containing (māthā).
Recall (R) = Number of instances matched with human decision/ total number of instances.
F-Measure (FM) = (2 * P * R) / (P + R).
Table 1. Performance analysis on the whole corpus.
Category Total no
of sentence
/ প /p n Total
right
Total
wrong
P R FM
right wrong right wrong right wrong
Accountancy 0 0 0 0 0 0 0 0
0
0 NA NA
Agriculture 13 1 0 0 0 10 2 11
2
1 0.85 0.92
Anthropology 24 19 5 0 0 0 0 19
5
1 0.79 0.88
Astrology 6 4 0 2 0 0 0 6
0
1 1.00 1.00
Astronomy 3 1 1 1 0 0 0 2
1
1 0.67 0.80
Banking 1 0 0 0 0 0 1 0
1
1 0.00 0.00
Biography 50 24 3 17 0 6 0 47
3
1 0.94 0.97
Botany 3 2 0 0 0 1 0 3
0
1 1.00 1.00
Business Math 2 0 0 0 0 1 1 1
1
1 0.50 0.67
Child Lit 86 36 7 21 1 20 1 77
9
1 0.90 0.94
Criticism 21 5 0 7 2 5 2 17
4
1 0.81 0.89
Dancing 2 0 0 1 0 0 1 1
1
1 0.50 0.67
Drawing 3 1 1 1 0 0 0 2
1
1 0.67 0.80
Economics 16 0 0 3 0 12 1 15
1
1 0.94 0.97
Education 1 0 0 1 0 0 0 1
0
1 1.00 1.00
Essay 23 7 1 10 0 5 0 22
1
1 0.96 0.98
Folk 0 0 0 0 0 0 0 0
0
0 NA NA
GameSport 60 34 4 13 0 9 0 56
4
1 0.93 0.97
GenSc 25 8 2 5 0 10 0 23
2
1 0.92 0.96
Geology 0 0 0 0 0 0 0 0
0
0 NA NA
HistoryWar 2 1 0 1 0 0 0 2
0
1 1.00 1.00
11. International Journal of Control Theory and Computer Modeling (IJCTCM) Vol.5, No.1, January 2015
11
প (māthāpichu), (māthāte), etc. We have achieved 100% accuracy in this regard, which
results the Precision of the output is 1. We have achieved over all 84% accuracy in case of 1747
sentences. In most of the cases the output is satisfactory, in the sense that it has rightly referred to
the intended sense. However, in certain cases, the performance of the system is not up to the mark
as it failed to capture the actual sense of the word used in specific syntactic frame. We have
looked into each distinct case of failure and investigated the results closely to identify the pitfalls
of the results (Section 8).
8. FEW CLOSE OBSERVATIONS
The following parameters mattered the most on the output during the execution of the system:
• As prior probability of each class (P(ci)) depends on the number of sentences in each
class, probability is affected due to huge difference between the number of sentences in each
class. To overcome this, it is wiser to keep number of sentences in each category constant (55 in
our case).
• As the total number of word frequency of each class (ni) remains at delimiter, the
conditional probability of keyword occurrence in a given class is affected due to the huge
inequality between the total number of word frequency in each class. But this incidence is not
under any one’s control, because the input sentences are taken from a real life data set. For this
reason, results in few cases are derived wrong.
• Use of dissimilar Bengali fonts has a very adverse impact on the output. Text data needs
to be rendered in uniform font.
• At certain cases, irrelevant words used in a sentence have caused inconsistency in final
calculation. These sentence structures matter a lot on the accuracy of the results. As for example:
a) s s a o প প s i a ঁ প
o e e ঁ я ঁ я প
ঁ я p n k p s প প t
o প e s ।
b) яn প o প 20 30 প k p 400 я 7.5 529 i প
sp nt প i e sp i ।
etc.
• Another major problem is just the opposite of the issue stated above. Here the short
length of a sentence (with regard to number of words) is a hindrance in capturing the intended
meaning of the word. As for example: ঁ । o c c i । etc. In case of
such sentences, after discarding the stop words, there remains insufficient information into the
sentences to sense the actual meaning of the ambiguous word.
We have tracked as well as analyzed all the 280 wrong outputs and noted that most of these have
occurred due to very long or very short syntactic constructions.
9. CONCLUSION AND FUTURE WORKS
In this paper, with a single lexical example, we have tried to show that our proposed approach can
disambiguate the sense of an ambiguous word. Except few cases, the result obtained from our
system is quite satisfactory according to our expectation. We argue that a stronger and properly
12. International Journal of Control Theory and Computer Modeling (IJCTCM) Vol.5, No.1, January 2015
12
populated learning set would invariable yield better result. In future, we plan to disambiguate the
most frequently used 100 ambiguous words of different parts-of-speech in Bengali.
Finally, as the target word used in our experiment is a noun, it is comparatively easy to handle its
inflections as the number of inflected forms is limited. In case of a verb, however, we may need a
stemmer for retrieving the stem or lemma from a conjugated verb form.
REFERENCES
[1] Ide, N., Véronis, J., (1998) “Word Sense Disambiguation: The State of the Art”, Computational
Linguistics, Vol. 24, No. 1, Pp. 1-40.
[2] Cucerzan, R.S., C. Schafer, and D. Yarowsky, (2002) “Combining classifiers for word sense
disambiguation”, Natural Language Engineering, Vol. 8, No. 4, Cambridge University Press, Pp. 327-
341.
[3] Nameh, M., S., Fakhrahmad, M., Jahromi, M.Z., (2011) “A New Approach to Word Sense
Disambiguation Based on Context Similarity”, Proceedings of the World Congress on Engineering,
Vol. I.
[4] Xiaojie, W., Matsumoto, Y., (2003) “Chinese word sense disambiguation by combining pseudo
training data”, Proceedings of The International Conference on Natural Language Processing and
Knowledge Engineering, Pp. 138-143.
[5] Navigli, R. (2009) “Word Sense Disambiguation: a Survey”, ACM Computing Surveys, Vol. 41,
No.2, ACM Press, Pp. 1-69.
[6] Gaizauskas, R., (1997) “Gold Standard Datasets for Evaluating Word Sense Disambiguation
Programs”, Computer Speech and Language, Vol. 12, No. 3, Special Issue on Evaluation of Speech
and Language Technology, Pp. 453-472.
[7] Seo, H., Chung, H., Rim, H., Myaeng, S. H., Kim, S., (2004) “Unsupervised word sense
disambiguation using WordNet relatives”, Computer Speech and Language, Vol. 18, No. 3, Pp. 253-
273.
[8] G. Miller, (1991) “WordNet: An on-line lexical database”, International Journal of Lexicography,
Vol.3, No. 4.
[9] Kolte, S.G., Bhirud, S.G., (2008) “Word Sense Disambiguation Using WordNet Domains”, First
International Conference on Digital Object Identifier, Pp. 1187-1191.
[10] Liu, Y., Scheuermann, P., Li, X., Zhu, X. (2007) “Using WordNet to Disambiguate Word Senses for
Text Classification”, Proceedings of the 7th International Conference on Computational Science,
Springer-Verlag, Pp. 781 - 789.
[11] Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.J., (1990) “WordNet An on-line
Lexical Database”, International Journal of Lexicography, 3(4): 235-244.
[12] Miller, G.A., (1993) “WordNet: A Lexical Database”, Comm. ACM, Vol. 38, No. 11, Pp. 39-41.
[13] Cañas, A.J., A. Valerio, J. Lalinde-Pulido, M. Carvalho, and M. Arguedas, (2003) “Using WordNet
for Word Sense Disambiguation to Support Concept Map Construction”, String Processing and
Information Retrieval, Pp. 350-359.
[14] http://en.wikipedia.org/wiki/Naive_bayes
[15] Dash, N.S., (2007) Indian scenario in language corpus generation. In, Dash, Ni.S., P. Dasgupta and P.
Sarkar (Eds.) Rainbow of Linguistics: Vol. I. Kolkata: T. Media Publication. Pp. 129-162.
[16] Dash, N.S., (1999) “Corpus oriented Bangla language processing”, Jadavpur Journal of Philosophy.
11(1): 1-28.
[17] Dash, N.S., (2000) “Bangla pronouns-a corpus based study”, Literary and Linguistic Computing.
15(4): 433-444.
[18] Dash, N.S., (2004) Language Corpora: Present Indian Need, Indian Statistical Institute, Kolkata,
http://www. elda. org/en/proj/scalla/SCALLA2004/dash. pdf.
13. International Journal of Control Theory and Computer Modeling (IJCTCM) Vol.5, No.1, January 2015
13
[19] Dash, N.S. (2005) Methods in Madness of Bengali Spelling: A Corpus-based Investigation, South
Asian Language Rewiew, Vol. XV, No. 2.
[20] Dash, N.S., (2012), From KCIE to LDC-IL: Some Milestones in NLP Journey in Indian Multilingual
Panorama. Indian Linguistics. 73(1-4): 129-146.
[21] Dash, N.S. and B.B. Chaudhuri, (2001) “A corpus based study of the Bangla language”, Indian
Journal of Linguistics. 20: 19-40.
[22] Dash, N.S. and B.B. Chaudhuri, (2001) “Corpus-based empirical analysis of form, function and
frequency of characters used in Bangla”, Published in Rayson, P. , Wilson, A. , McEnery, T. ,
Hardie, A. , and Khoja, S. , (eds.) Special issue of the Proceedings of the Corpus Linguistics 2001
Conference, Lancaster: Lancaster University Press. UK. 13: 144-157. 2001.
[23] Dash, N.S. and B.B. Chaudhuri., (2002) Corpus generation and text processing, International Journal
of Dravidian Linguistics. 31(1): 25-44.
[24] Dash, N.S. and B.B. Chaudhuri, (2002) “Using Text Corpora for Understanding Polysemy in
Bangla”, Proceedings of the Language Engineering Conference (LEC'02) IEEE.
[25] Dolamic, L. and J. Savoy, (2010) “Comparative Study of Indexing and Search Strategies for the
Hindi, Marathi and Bengali Languages”, ACM Transactions on Asian Language Information
Processing, 9(3): 1-24.
[26] Jurafsky, D. and J.H. Martin, (2000) Speech and Language Processing, ISBN 81-7808-594-1, Pearson
Education Asia, page no: 604.
[27] http://www.isical.ac.in/~lru/wordnetnew/
Authors
Alok Ranjan Pal has been working as an a Assistant Professor in Computer Science and
Engineering Department of College of Engineering and Management, Kolaghat since 2006.
He has completed his Bachelor's and Master's degree under WBUT. Now, he is working on
Natural Language Processing.
Dr. Diganta Saha is currently working as an Associate Professor in Department of Computer
Science & Engineering, Jadavpur University, Kolkata. His field of specialization is Machine
Translation/ Natural Language Processing/ Mobile Computing/ Pattern Classification.
Dr. Niladri Sekhar Dash is currently working as an Assistant Professor in Department of
Linguistics and Language Technology, Indian Statistical Institute, Kolkata. His field of
specialization is Corpus Linguistics, Natural Language Processing, Language Technology,
Word Sense Disambiguation, Computer Assisted Language Teaching, Machine Translation,
Computational Lexicography, Field Linguistics, Graded Vocabulary Generation, Applied
Linguistics, Educational Technology, Bengali Language and Linguistics.