As the web is increasing exponentially, so it is very much difficult to provide relevant information to the information seekers. While searching some information on the web, users can easily fade out in rich hypertext. The existing techniques provide the results that are not up to the mark. This paper focuses on the technique that helps in offering more accurate results, especially in case of Homographs. Homograph is a word that shares the same written form but has different meanings. The technique that shows how senses of words can play an important role in offering accurate search results, is described in following sections. While adopting this technique user can receive only relevant pages on the top of the search result.
Parameters Optimization for Improving ASR Performance in Adverse Real World N...Waqas Tariq
From the existing research it has been observed that many techniques and methodologies are available for performing every step of Automatic Speech Recognition (ASR) system, but the performance (Minimization of Word Error Recognition-WER and Maximization of Word Accuracy Rate- WAR) of the methodology is not dependent on the only technique applied in that method. The research work indicates that, performance mainly depends on the category of the noise, the level of the noise and the variable size of the window, frame, frame overlap etc is considered in the existing methods. The main aim of the work presented in this paper is to use variable size of parameters like window size, frame size and frame overlap percentage to observe the performance of algorithms for various categories of noise with different levels and also train the system for all size of parameters and category of real world noisy environment to improve the performance of the speech recognition system. This paper presents the results of Signal-to-Noise Ratio (SNR) and Accuracy test by applying variable size of parameters. It is observed that, it is really very hard to evaluate test results and decide parameter size for ASR performance improvement for its resultant optimization. Hence, this study further suggests the feasible and optimum parameter size using Fuzzy Inference System (FIS) for enhancing resultant accuracy in adverse real world noisy environmental conditions. This work will be helpful to give discriminative training of ubiquitous ASR system for better Human Computer Interaction (HCI). Keywords: ASR Performance, ASR Parameters Optimization, Multi-Environmental Training, Fuzzy Inference System for ASR, ubiquitous ASR system, Human Computer Interaction (HCI)
An Improved Approach for Word Ambiguity RemovalWaqas Tariq
Word ambiguity removal is a task of removing ambiguity from a word, i.e. correct sense of word is identified from ambiguous sentences. This paper describes a model that uses Part of Speech tagger and three categories for word sense disambiguation (WSD). Human Computer Interaction is very needful to improve interactions between users and computers. For this, the Supervised and Unsupervised methods are combined. The WSD algorithm is used to find the efficient and accurate sense of a word based on domain information. The accuracy of this work is evaluated with the aim of finding best suitable domain of word. Keywords: Human Computer Interaction, Supervised Training, Unsupervised Learning, Word Ambiguity, Word sense disambiguation
MODELLING OF INTELLIGENT AGENTS USING A–PROLOGijaia
Nowadays, research in artificial intelligence has widely grown in areas such as knowledge representation,
goal-directed behaviour and knowledge reusability, all of them directly relevant to improving intelligent
agents in computer games. In a particular way, we focus on the development of a novel algorithm that
allow an agent to combine fundamental theories such as reasoning, learning and simulation. This
algorithm combines a system of logical rules and a simulation mechanism based on learning that makes
our agent has an infallible mechanism for decision-making into the game “connect four”. The logic system
is developed in a modelling language known as Answer Set Programming or A–Prolog. This paradigm is
the integration of two well-known languages, namely Prolog and ASP.
AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-W...cscpconf
In this paper, we are going to propose a technique to find meaning of words using Word Sense Disambiguation using supervised and unsupervised learning. This limitation of information is main flaw of the supervised approach. Our proposed approach focuses to overcome the limitation using learning set which is enriched in dynamic way maintaining new data. We introduce a mixed methodology having “Modified Lesk” approach and “Bag-of-Words” having enriched bags using learning methods.
An approach to word sense disambiguation combining modified lesk and bag of w...csandit
In this paper, we are going to propose a technique to find meaning of words using Word Sense
Disambiguation using supervised and unsupervised learning. This limitation of information is
main flaw of the supervised approach. Our proposed approach focuses to overcome the
limitation using learning set which is enriched in dynamic way maintaining new data. We
introduce a mixed methodology having “Modified Lesk” approach and “Bag-of-Words” having
enriched bags using learning methods.
Extractive Summarization with Very Deep Pretrained Language Modelgerogepatton
Recent development of generative pretrained language models has been proven very successful on a wide range of NLP tasks, such as text classification, question answering, textual entailment and so on.In this work, we present a two-phase encoder decoder architecture based on Bidirectional Encoding Representation from Transformers(BERT) for extractive summarization task. We evaluated our model by both automatic metrics and human annotators, and demonstrated that the architecture achieves the state-of-the-art comparable result on large scale corpus - CNN/Daily Mail1. As the best of our knowledge, this is the first work that applies BERT based architecture to a text summarization task and achieved the state-of-the-art comparable result.
Parameters Optimization for Improving ASR Performance in Adverse Real World N...Waqas Tariq
From the existing research it has been observed that many techniques and methodologies are available for performing every step of Automatic Speech Recognition (ASR) system, but the performance (Minimization of Word Error Recognition-WER and Maximization of Word Accuracy Rate- WAR) of the methodology is not dependent on the only technique applied in that method. The research work indicates that, performance mainly depends on the category of the noise, the level of the noise and the variable size of the window, frame, frame overlap etc is considered in the existing methods. The main aim of the work presented in this paper is to use variable size of parameters like window size, frame size and frame overlap percentage to observe the performance of algorithms for various categories of noise with different levels and also train the system for all size of parameters and category of real world noisy environment to improve the performance of the speech recognition system. This paper presents the results of Signal-to-Noise Ratio (SNR) and Accuracy test by applying variable size of parameters. It is observed that, it is really very hard to evaluate test results and decide parameter size for ASR performance improvement for its resultant optimization. Hence, this study further suggests the feasible and optimum parameter size using Fuzzy Inference System (FIS) for enhancing resultant accuracy in adverse real world noisy environmental conditions. This work will be helpful to give discriminative training of ubiquitous ASR system for better Human Computer Interaction (HCI). Keywords: ASR Performance, ASR Parameters Optimization, Multi-Environmental Training, Fuzzy Inference System for ASR, ubiquitous ASR system, Human Computer Interaction (HCI)
An Improved Approach for Word Ambiguity RemovalWaqas Tariq
Word ambiguity removal is a task of removing ambiguity from a word, i.e. correct sense of word is identified from ambiguous sentences. This paper describes a model that uses Part of Speech tagger and three categories for word sense disambiguation (WSD). Human Computer Interaction is very needful to improve interactions between users and computers. For this, the Supervised and Unsupervised methods are combined. The WSD algorithm is used to find the efficient and accurate sense of a word based on domain information. The accuracy of this work is evaluated with the aim of finding best suitable domain of word. Keywords: Human Computer Interaction, Supervised Training, Unsupervised Learning, Word Ambiguity, Word sense disambiguation
MODELLING OF INTELLIGENT AGENTS USING A–PROLOGijaia
Nowadays, research in artificial intelligence has widely grown in areas such as knowledge representation,
goal-directed behaviour and knowledge reusability, all of them directly relevant to improving intelligent
agents in computer games. In a particular way, we focus on the development of a novel algorithm that
allow an agent to combine fundamental theories such as reasoning, learning and simulation. This
algorithm combines a system of logical rules and a simulation mechanism based on learning that makes
our agent has an infallible mechanism for decision-making into the game “connect four”. The logic system
is developed in a modelling language known as Answer Set Programming or A–Prolog. This paradigm is
the integration of two well-known languages, namely Prolog and ASP.
AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-W...cscpconf
In this paper, we are going to propose a technique to find meaning of words using Word Sense Disambiguation using supervised and unsupervised learning. This limitation of information is main flaw of the supervised approach. Our proposed approach focuses to overcome the limitation using learning set which is enriched in dynamic way maintaining new data. We introduce a mixed methodology having “Modified Lesk” approach and “Bag-of-Words” having enriched bags using learning methods.
An approach to word sense disambiguation combining modified lesk and bag of w...csandit
In this paper, we are going to propose a technique to find meaning of words using Word Sense
Disambiguation using supervised and unsupervised learning. This limitation of information is
main flaw of the supervised approach. Our proposed approach focuses to overcome the
limitation using learning set which is enriched in dynamic way maintaining new data. We
introduce a mixed methodology having “Modified Lesk” approach and “Bag-of-Words” having
enriched bags using learning methods.
Extractive Summarization with Very Deep Pretrained Language Modelgerogepatton
Recent development of generative pretrained language models has been proven very successful on a wide range of NLP tasks, such as text classification, question answering, textual entailment and so on.In this work, we present a two-phase encoder decoder architecture based on Bidirectional Encoding Representation from Transformers(BERT) for extractive summarization task. We evaluated our model by both automatic metrics and human annotators, and demonstrated that the architecture achieves the state-of-the-art comparable result on large scale corpus - CNN/Daily Mail1. As the best of our knowledge, this is the first work that applies BERT based architecture to a text summarization task and achieved the state-of-the-art comparable result.
A survey on phrase structure learning methods for text classificationijnlc
Text classification is a task of automatic classification of text into one of the predefined categories. The
problem of text classification has been widely studied in different communities like natural language
processing, data mining and information retrieval. Text classification is an important constituent in many
information management tasks like topic identification, spam filtering, email routing, language
identification, genre classification, readability assessment etc. The performance of text classification
improves notably when phrase patterns are used. The use of phrase patterns helps in capturing non-local
behaviours and thus helps in the improvement of text classification task. Phrase structure extraction is the
first step to continue with the phrase pattern identification. In this survey, detailed study of phrase structure
learning methods have been carried out. This will enable future work in several NLP tasks, which uses
syntactic information from phrase structure like grammar checkers, question answering, information
extraction, machine translation, text classification. The paper also provides different levels of classification
and detailed comparison of the phrase structure learning methods.
Functional magnetic resonance imaging-based brain decoding with visual semant...IJECEIAES
The activity pattern of the brain has been activated to identify a person in mind. Using the function magnetic resonance imaging (fMRI) to decipher brain decoding is the most accepted method. However, the accuracy of fMRI-based brain decoder is still restricted due to limited training samples. The limitations of the brain decoder using fMRI are passed through the design features proposed for many label coding and model training to predict these characteristics for a particular label. Moreover, what kind of semantic features for deciphering the neurological activity patterns are unclear. In current work, a new calculation model for learning decoding labels that is consistent with fMRI activity responses. The approach demonstrates the proposed corresponding label's success in terms of accuracy, which is decoded from brain activity patterns and compared with conventional text-derived feature technique. Besides, experimental studies present a training model based on multi-tasking to reduce the problems of limited training data sets. Therefore, the multi-task learning model is more efficient than modern methods of calculation, and decoding features may be easily obtained.
The paper presents a k-means based semi-supervised clustering approach for
recognizing and classifying P300 signals for BCI Speller System. P300 signals are proved to
be the most suitable Event Related Potential (ERP) signal, used to develop the BCI systems.
Due to non-stationary nature of ERP signals, the wavelet transform is the best analysis tool
for extracting informative features from P300 signals. The focus of the research is on semi-
supervised clustering as supervised clustering approach need large amount of labeled data
for training, which is a tedious task. Hence works for small-labeled datasets to train
classifiers. On the other hand, unsupervised clustering works when no prior information is
available i.e. totally unlabeled data. Thus leads to low level of performance. The in-between
solution is to use semi-supervised clustering, which uses a few labeled with large unlabeled
data causes less trouble and time. The authors have selected and defined adhoc features and
assumed the Clusters for small datasets. This motivates us to propose a novel approach that
discovers the features embedded in P300 (EEG) signals, using an k-means based semi-
supervised cluster classification using ensemble SVM
The purpose research is to develop the decision model of Multi-Criteria Group Decision Making (MCGDM) into Interval Value Fuzzy Multi-Criteria Group Decision Making (IV-FMCGDM), while the specific purpose is to construct decision-making model of Adaptive Interval Value Fuzzy Analytic Hierarchy Process (AIV- FAHP) uses Triangular Fuzzy Number (TFN) and group decision aggregation functions using Interval Value Geometric Means Aggregation (IV-GMA). The novelty research is to study the concept of group decision making by improving the middle point on the Interval Value Triangular Fuzzy Number (IV TFN). It provides more accurate modeling, and better rating performance, and more effective linguistic representation. This research produced a new decision-making model and algorithm based on AIV-FAHP used to measure the quality of e-learning.
The project re-implements the architecture of the paper Reasoning with Neural Tensor Networks for Knowledge Base Completion in Torch framework, achieving similar accuracy results with an elegant implementation in a modern language.
Below are some links for further details:
https://github.com/agarwal-shubham/Reasoning-Over-Knowledge-Base
http://darsh510.github.io/IREPROJ/
Sentiment classification aims to detect information such as opinions, explicit , implicit feelings expressed
in text. The most existing approaches are able to detect either explicit expressions or implicit expressions of
sentiments in the text separately. In this proposed framework it will detect both Implicit and Explicit
expressions available in the meeting transcripts. It will classify the Positive, Negative, Neutral words and
also identify the topic of the particular meeting transcripts by using fuzzy logic. This paper aims to add
some additional features for improving the classification method. The quality of the sentiment classification
is improved using proposed fuzzy logic framework .In this fuzzy logic it includes the features like Fuzzy
rules and Fuzzy C-means algorithm.The quality of the output is evaluated using the parameters such as
precision, recall, f-measure. Here Fuzzy C-means Clustering technique measured in terms of Purity and
Entropy. The data set was validated using 10-fold cross validation method and observed 95% confidence
interval between the accuracy values .Finally, the proposed fuzzy logic method produced more than 85 %
accurate results and error rate is very less compared to existing sentiment classification techniques.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Abstract This paper represents a Semantic Analyzer for checking the semantic correctness of the given input text. We describe our system as the one which analyzes the text by comparing it with the meaning of the words given in the WordNet. The Semantic Analyzer thus developed not only detects and displays semantic errors in the text but it also corrects them. Keywords: Part of Speech (POS) Tagger, Morphological Analyzer, Syntactic Analyzer, Semantic Analyzer, Natural Language (NL)
The peer-reviewed International Journal of Engineering Inventions (IJEI) is started with a mission to encourage contribution to research in Science and Technology. Encourage and motivate researchers in challenging areas of Sciences and Technology.
The article presents Part of Speech Tagging for Nepali Text using three techniques of Artificial Neural networks. The novel algorithm for POS tagging is introduced .Features are extracted from the marginal probability of Hidden Markov Model. The extracted features are supplied to 3 different ANN architectures viz. Radial Basis Function (RBF) network, General Regression Neural Networks (GRNN) and Feed forward Neural network as an input vector for each word. Two different Annotated Tagged sets are constructed for training and testing purpose. Results are compared using all the 3 techniques and applied on both the sets. GRNN based POS tagging technique is found better as it produces 100% and 98.32% accuracies for both training and testing sets respectively.
Generation of Question and Answer from Unstructured Document using Gaussian M...IJACEE IJACEE
Question Answering (QA) system is one of the ever growing applications in Natural Language Processing. The purpose of Automatic Question and Answer Generation system is to generate all possible questions and its relevant answers from a given unstructured document. Complex sentences are simplified to make question generation easier. The accuracy of the generated questions is measured by identifying the subtopics from the text using Gaussian Mixture Neural Topic Model (GMNTM).The similarity between generated questions and text are calculated using Extended String Subsequence Kernel (ESSK). The syntactic correctness of the questions is measured by Syntactic Tree Kernel which computes the similarity scores between each sentence in the given context and generated questions. Based on the similarity score, questions are ranked. The answers for the generated questions are extracted using Pattern Matching Approach. This system is expected to produce better accuracy when compared with the system using Latent Dirichlet Allocation (LDA) for subtopic identification.
A hybrid composite features based sentence level sentiment analyzerIAESIJAI
Current lexica and machine learning based sentiment analysis approaches
still suffer from a two-fold limitation. First, manual lexicon construction and
machine training is time consuming and error-prone. Second, the
prediction’s accuracy entails sentences and their corresponding training text
should fall under the same domain. In this article, we experimentally
evaluate four sentiment classifiers, namely support vector machines (SVMs),
Naive Bayes (NB), logistic regression (LR) and random forest (RF). We
quantify the quality of each of these models using three real-world datasets
that comprise 50,000 movie reviews, 10,662 sentences, and 300 generic
movie reviews. Specifically, we study the impact of a variety of natural
language processing (NLP) pipelines on the quality of the predicted
sentiment orientations. Additionally, we measure the impact of incorporating
lexical semantic knowledge captured by WordNet on expanding original
words in sentences. Findings demonstrate that the utilizing different NLP
pipelines and semantic relationships impacts the quality of the sentiment
analyzers. In particular, results indicate that coupling lemmatization and
knowledge-based n-gram features proved to produce higher accuracy results.
With this coupling, the accuracy of the SVM classifier has improved to
90.43%, while it was 86.83%, 90.11%, 86.20%, respectively using the three
other classifiers.
This slide includes :
Types of Machine Learning
Supervised Learning
Brain
Neuron
Design a Learning System
Perspectives
Issues in Machine Learning
Learning Task
Learning as Search
Hypothesis
Version Spaces
Candidate elimination algorithm
linear Discriminant
Perception
Linear Separability
Linear Regression
Unsupervised Learning
Reinforcement Learning
Evolutionary Learning
A survey on phrase structure learning methods for text classificationijnlc
Text classification is a task of automatic classification of text into one of the predefined categories. The
problem of text classification has been widely studied in different communities like natural language
processing, data mining and information retrieval. Text classification is an important constituent in many
information management tasks like topic identification, spam filtering, email routing, language
identification, genre classification, readability assessment etc. The performance of text classification
improves notably when phrase patterns are used. The use of phrase patterns helps in capturing non-local
behaviours and thus helps in the improvement of text classification task. Phrase structure extraction is the
first step to continue with the phrase pattern identification. In this survey, detailed study of phrase structure
learning methods have been carried out. This will enable future work in several NLP tasks, which uses
syntactic information from phrase structure like grammar checkers, question answering, information
extraction, machine translation, text classification. The paper also provides different levels of classification
and detailed comparison of the phrase structure learning methods.
Functional magnetic resonance imaging-based brain decoding with visual semant...IJECEIAES
The activity pattern of the brain has been activated to identify a person in mind. Using the function magnetic resonance imaging (fMRI) to decipher brain decoding is the most accepted method. However, the accuracy of fMRI-based brain decoder is still restricted due to limited training samples. The limitations of the brain decoder using fMRI are passed through the design features proposed for many label coding and model training to predict these characteristics for a particular label. Moreover, what kind of semantic features for deciphering the neurological activity patterns are unclear. In current work, a new calculation model for learning decoding labels that is consistent with fMRI activity responses. The approach demonstrates the proposed corresponding label's success in terms of accuracy, which is decoded from brain activity patterns and compared with conventional text-derived feature technique. Besides, experimental studies present a training model based on multi-tasking to reduce the problems of limited training data sets. Therefore, the multi-task learning model is more efficient than modern methods of calculation, and decoding features may be easily obtained.
The paper presents a k-means based semi-supervised clustering approach for
recognizing and classifying P300 signals for BCI Speller System. P300 signals are proved to
be the most suitable Event Related Potential (ERP) signal, used to develop the BCI systems.
Due to non-stationary nature of ERP signals, the wavelet transform is the best analysis tool
for extracting informative features from P300 signals. The focus of the research is on semi-
supervised clustering as supervised clustering approach need large amount of labeled data
for training, which is a tedious task. Hence works for small-labeled datasets to train
classifiers. On the other hand, unsupervised clustering works when no prior information is
available i.e. totally unlabeled data. Thus leads to low level of performance. The in-between
solution is to use semi-supervised clustering, which uses a few labeled with large unlabeled
data causes less trouble and time. The authors have selected and defined adhoc features and
assumed the Clusters for small datasets. This motivates us to propose a novel approach that
discovers the features embedded in P300 (EEG) signals, using an k-means based semi-
supervised cluster classification using ensemble SVM
The purpose research is to develop the decision model of Multi-Criteria Group Decision Making (MCGDM) into Interval Value Fuzzy Multi-Criteria Group Decision Making (IV-FMCGDM), while the specific purpose is to construct decision-making model of Adaptive Interval Value Fuzzy Analytic Hierarchy Process (AIV- FAHP) uses Triangular Fuzzy Number (TFN) and group decision aggregation functions using Interval Value Geometric Means Aggregation (IV-GMA). The novelty research is to study the concept of group decision making by improving the middle point on the Interval Value Triangular Fuzzy Number (IV TFN). It provides more accurate modeling, and better rating performance, and more effective linguistic representation. This research produced a new decision-making model and algorithm based on AIV-FAHP used to measure the quality of e-learning.
The project re-implements the architecture of the paper Reasoning with Neural Tensor Networks for Knowledge Base Completion in Torch framework, achieving similar accuracy results with an elegant implementation in a modern language.
Below are some links for further details:
https://github.com/agarwal-shubham/Reasoning-Over-Knowledge-Base
http://darsh510.github.io/IREPROJ/
Sentiment classification aims to detect information such as opinions, explicit , implicit feelings expressed
in text. The most existing approaches are able to detect either explicit expressions or implicit expressions of
sentiments in the text separately. In this proposed framework it will detect both Implicit and Explicit
expressions available in the meeting transcripts. It will classify the Positive, Negative, Neutral words and
also identify the topic of the particular meeting transcripts by using fuzzy logic. This paper aims to add
some additional features for improving the classification method. The quality of the sentiment classification
is improved using proposed fuzzy logic framework .In this fuzzy logic it includes the features like Fuzzy
rules and Fuzzy C-means algorithm.The quality of the output is evaluated using the parameters such as
precision, recall, f-measure. Here Fuzzy C-means Clustering technique measured in terms of Purity and
Entropy. The data set was validated using 10-fold cross validation method and observed 95% confidence
interval between the accuracy values .Finally, the proposed fuzzy logic method produced more than 85 %
accurate results and error rate is very less compared to existing sentiment classification techniques.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Abstract This paper represents a Semantic Analyzer for checking the semantic correctness of the given input text. We describe our system as the one which analyzes the text by comparing it with the meaning of the words given in the WordNet. The Semantic Analyzer thus developed not only detects and displays semantic errors in the text but it also corrects them. Keywords: Part of Speech (POS) Tagger, Morphological Analyzer, Syntactic Analyzer, Semantic Analyzer, Natural Language (NL)
The peer-reviewed International Journal of Engineering Inventions (IJEI) is started with a mission to encourage contribution to research in Science and Technology. Encourage and motivate researchers in challenging areas of Sciences and Technology.
The article presents Part of Speech Tagging for Nepali Text using three techniques of Artificial Neural networks. The novel algorithm for POS tagging is introduced .Features are extracted from the marginal probability of Hidden Markov Model. The extracted features are supplied to 3 different ANN architectures viz. Radial Basis Function (RBF) network, General Regression Neural Networks (GRNN) and Feed forward Neural network as an input vector for each word. Two different Annotated Tagged sets are constructed for training and testing purpose. Results are compared using all the 3 techniques and applied on both the sets. GRNN based POS tagging technique is found better as it produces 100% and 98.32% accuracies for both training and testing sets respectively.
Generation of Question and Answer from Unstructured Document using Gaussian M...IJACEE IJACEE
Question Answering (QA) system is one of the ever growing applications in Natural Language Processing. The purpose of Automatic Question and Answer Generation system is to generate all possible questions and its relevant answers from a given unstructured document. Complex sentences are simplified to make question generation easier. The accuracy of the generated questions is measured by identifying the subtopics from the text using Gaussian Mixture Neural Topic Model (GMNTM).The similarity between generated questions and text are calculated using Extended String Subsequence Kernel (ESSK). The syntactic correctness of the questions is measured by Syntactic Tree Kernel which computes the similarity scores between each sentence in the given context and generated questions. Based on the similarity score, questions are ranked. The answers for the generated questions are extracted using Pattern Matching Approach. This system is expected to produce better accuracy when compared with the system using Latent Dirichlet Allocation (LDA) for subtopic identification.
A hybrid composite features based sentence level sentiment analyzerIAESIJAI
Current lexica and machine learning based sentiment analysis approaches
still suffer from a two-fold limitation. First, manual lexicon construction and
machine training is time consuming and error-prone. Second, the
prediction’s accuracy entails sentences and their corresponding training text
should fall under the same domain. In this article, we experimentally
evaluate four sentiment classifiers, namely support vector machines (SVMs),
Naive Bayes (NB), logistic regression (LR) and random forest (RF). We
quantify the quality of each of these models using three real-world datasets
that comprise 50,000 movie reviews, 10,662 sentences, and 300 generic
movie reviews. Specifically, we study the impact of a variety of natural
language processing (NLP) pipelines on the quality of the predicted
sentiment orientations. Additionally, we measure the impact of incorporating
lexical semantic knowledge captured by WordNet on expanding original
words in sentences. Findings demonstrate that the utilizing different NLP
pipelines and semantic relationships impacts the quality of the sentiment
analyzers. In particular, results indicate that coupling lemmatization and
knowledge-based n-gram features proved to produce higher accuracy results.
With this coupling, the accuracy of the SVM classifier has improved to
90.43%, while it was 86.83%, 90.11%, 86.20%, respectively using the three
other classifiers.
This slide includes :
Types of Machine Learning
Supervised Learning
Brain
Neuron
Design a Learning System
Perspectives
Issues in Machine Learning
Learning Task
Learning as Search
Hypothesis
Version Spaces
Candidate elimination algorithm
linear Discriminant
Perception
Linear Separability
Linear Regression
Unsupervised Learning
Reinforcement Learning
Evolutionary Learning
Machine Learning Techniques with Ontology for Subjective Answer Evaluationijnlc
Computerized Evaluation of English Essays is performed using Machine learning techniques like Latent
Semantic Analysis (LSA), Generalized LSA, Bilingual Evaluation Understudy and Maximum Entropy.
Ontology, a concept map of domain knowledge, can enhance the performance of these techniques. Use of
Ontology makes the evaluation process holistic as presence of keywords, synonyms, the right word
combination and coverage of concepts can be checked. In this paper, the above mentioned techniques are
implemented both with and without Ontology and tested on common input data consisting of technical
answers of Computer Science. Domain Ontology of Computer Graphics is designed and developed. The
software used for implementation includes Java Programming Language and tools such as MATLAB,
Protégé, etc. Ten questions from Computer Graphics with sixty answers for each question are used for
testing. The results are analyzed and it is concluded that the results are more accurate with use of
Ontology.
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...ijtsrd
Natural Language Processing NLP is the one of the major filed of Natural Language Generation NLG . NLG can generate natural language from a machine representation. Generating suggestions for a sentence especially for Indian languages is much difficult. One of the major reason is that it is morphologically rich and the format is just reverse of English language. By using deep learning approach with the help of Long Short Term Memory LSTM layers we can generate a possible set of solutions for erroneous part in a sentence. To effectively generate a bunch of sentences having equivalent meaning as the original sentence using Deep Learning DL approach is to train a model on this task, e.g. we need thousands of examples of inputs and outputs with which to train a model. Veena S Nair | Amina Beevi A ""Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Learning"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd23842.pdf
Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/23842/suggestion-generation-for-specific-erroneous-part-in-a-sentence-using-deep-learning/veena-s-nair
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGcsandit
The proposed approach deals with the detection of jargon words in electronic data in different communication mediums like internet, mobile services etc. But in the real life, the jargon words are not used in complete word forms always. Most of the times, those words are used in different abbreviated forms like sounds alike forms, taboo morphemes etc. This proposed approach detects those abbreviated forms also using semi supervised learning methodology. This learning methodology derives the probability of a suspicious word to be a jargon word by the synset and concept analysis of the text.
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGcscpconf
The proposed approach deals with the detection of jargon words in electronic data in different communication mediums like internet, mobile services etc. But in the real life, the jargon words are not used in complete word forms always. Most of the times, those words are used in different abbreviated forms like sounds alike forms, taboo morphemes etc. This proposed approach detects those abbreviated forms also using semi supervised learning methodology. This learning methodology derives the probability of a suspicious word to be a jargon word by the synset and
concept analysis of the text.
The Dyslexic reading challenge is not completely resolved till date with advance learning algorithms. The
need for breaking complex words and aiding the children to remember with hints is the need of the hour.
The development of reading assistant with breakdown of complex words, using hints to remember backword
and forward iterations, creative use of word cloud and using deep learning techniques to effectively
tokenize and assist the struggling readers.
DYSLEXICREADING ASSISTANCE WITH LANGUAGEPROCESSING ALGORITHMSijcsit
The Dyslexic reading challenge is not completely resolved till date with advance learning algorithms. The
need for breaking complex words and aiding the children to remember with hints is the need of the hour.
The development of reading assistant with breakdown of complex words, using hints to remember backword
and forward iterations, creative use of word cloud and using deep learning techniques to effectively
tokenize and assist the struggling readers.
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
The article represents the Sentiment Analysis (SA) and Tense Classification using Skip gram model for the word to vector encoding on Nepali language. The experiment on SA for positive-negative classification is carried out in two ways. In the first experiment the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP) classification and it is observed that the F1 score of 0.6486 is achieved for positive-negative classification with overall accuracy of 68%. Whereas in the second experiment the verb chunks are extracted using Nepali parser and carried out the similar experiment on the verb chunks. F1 scores of 0.6779 is observed for positive -negative classification with overall accuracy of 85%. Hence, Chunker based sentiment analysis is proven to be better than sentiment analysis using sentences. This paper also proposes using a skip-gram model to identify the tenses of Nepali sentences and verbs. In the third experiment, the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP)classification and it is observed that verb chunks had very low overall accuracy of 53%. In the fourth experiment, conducted for Tense Classification using Sentences resulted in improved efficiency with overall accuracy of 89%. Past tenses were identified and classified more accurately than other tenses. Hence, sentence based tense classification is proven to be better than verb Chunker based sentiment analysis.
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
The article represents the Sentiment Analysis (SA) and Tense Classification using Skip gram model for the word to vector encoding on Nepali language. The experiment on SA for positive-negative classification is carried out in two ways. In the first experiment the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP) classification and it is observed that the F1 score of 0.6486 is achieved for positive-negative classification with overall accuracy of 68%. Whereas in the second experiment the verb chunks are extracted using Nepali parser and carried out the similar experiment on the verb chunks. F1 scores of 0.6779 is observed for positive -negative classification with overall accuracy of 85%. Hence, Chunker based sentiment analysis is proven to be better than sentiment analysis using sentences. This paper also proposes using a skip-gram model to identify the tenses of Nepali sentences and verbs. In the third experiment, the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP)classification and it is observed that verb chunks had very low overall accuracy of 53%. In the fourth experiment, conducted for Tense Classification using Sentences resulted in improved efficiency with overall accuracy of 89%. Past tenses were identified and classified more accurately than other tenses. Hence, sentence based tense classification is proven to be better than verb Chunker based sentiment analysis.
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...IJDKP
Many applications of automatic document classification require learning accurately with little training
data. The semi-supervised classification technique uses labeled and unlabeled data for training. This
technique has shown to be effective in some cases; however, the use of unlabeled data is not always
beneficial.
On the other hand, the emergence of web technologies has originated the collaborative development of
ontologies. In this paper, we propose the use of ontologies in order to improve the accuracy and efficiency
of the semi-supervised document classification.
We used support vector machines, which is one of the most effective algorithms that have been studied for
text. Our algorithm enhances the performance of transductive support vector machines through the use of
ontologies. We report experimental results applying our algorithm to three different datasets. Our
experiments show an increment of accuracy of 4% on average and up to 20%, in comparison with the
traditional semi-supervised model.
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...kevig
This study investigates the effectiveness of Knowledge Named Entity Recognition in Online Judges (OJs). OJs are lacking in the classification of topics and limited to the IDs only. Therefore a lot of time is consumed in finding programming problems more specifically in knowledge entities.A Bidirectional Long Short-Term Memory (BiLSTM) with Conditional Random Fields (CRF) model is applied for the recognition of knowledge named entities existing in the solution reports.For the test run, more than 2000 solution reports are crawled from the Online Judges and processed for the model output. The stability of the model is also assessed with the higher F1 value. The results obtained through the proposed BiLSTM-CRF model are more effectual (F1: 98.96%) and efficient in lead-time.
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...ijnlc
This study investigates the effectiveness of Knowledge Named Entity Recognition in Online Judges (OJs). OJs are lacking in the classification of topics and limited to the IDs only. Therefore a lot of time is consumed in finding programming problems more specifically in knowledge entities.A Bidirectional Long Short-Term Memory (BiLSTM) with Conditional Random Fields (CRF) model is applied for the recognition of knowledge named entities existing in the solution reports.For the test run, more than 2000 solution reports are crawled from the Online Judges and processed for the model output. The stability of the model is
also assessed with the higher F1 value. The results obtained through the proposed BiLSTM-CRF model are more effectual (F1: 98.96%) and efficient in lead-time.
Derric A. Alkis C
Abstract:
Delivering the customer to a high degree of confidence and the seller for more information about the products and the desire of customers through the use of modern technology and Machine Learning through comments left on the product to see and evaluate the comments added later and thus evaluate the product, whether good or bad.
Myanmar news summarization using different word representations IJECEIAES
There is enormous amount information available in different forms of sources and genres. In order to extract useful information from a massive amount of data, automatic mechanism is required. The text summarization systems assist with content reduction keeping the important information and filtering the non-important parts of the text. Good document representation is really important in text summarization to get relevant information. Bag-ofwords cannot give word similarity on syntactic and semantic relationship. Word embedding can give good document representation to capture and encode the semantic relation between words. Therefore, centroid based on word embedding representation is employed in this paper. Myanmar news summarization based on different word embedding is proposed. In this paper, Myanmar local and international news are summarized using centroid-based word embedding summarizer using the effectiveness of word representation approach, word embedding. Experiments were done on Myanmar local and international news dataset using different word embedding models and the results are compared with performance of bag-of-words summarization. Centroid summarization using word embedding performs comprehensively better than centroid summarization using bag-of-words.
Computing semantic similarity between two words comes with variety of approaches. This is mainly
essential for the applications such as text analysis, text understanding. In traditional system search engines are used to compute the similarity between words. In that search engines are keyword based. There is one drawback that user should know what exactly they are looking for. There are mainly two main approaches for computation namely
knowledge based and corpus based approaches. But there is one drawback that these two approaches are not suitable for computing similarity between multi-word expressions. This system provides efficient and effective approach for computing term similarity using semantic network approach. A clustering approach is used in order to improve the
accuracy of the semantic similarity. This approach is more efficient than other computing algorithms. technique
can also apply to large scale dataset to compute term similarity.
Similar to SENSE DISAMBIGUATION TECHNIQUE FOR PROVIDING MORE ACCURATE RESULTS IN WEB SEARCH (20)
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
How to Create Map Views in the Odoo 17 ERPCeline George
The map views are useful for providing a geographical representation of data. They allow users to visualize and analyze the data in a more intuitive manner.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
This is a presentation by Dada Robert in a Your Skill Boost masterclass organised by the Excellence Foundation for South Sudan (EFSS) on Saturday, the 25th and Sunday, the 26th of May 2024.
He discussed the concept of quality improvement, emphasizing its applicability to various aspects of life, including personal, project, and program improvements. He defined quality as doing the right thing at the right time in the right way to achieve the best possible results and discussed the concept of the "gap" between what we know and what we do, and how this gap represents the areas we need to improve. He explained the scientific approach to quality improvement, which involves systematic performance analysis, testing and learning, and implementing change ideas. He also highlighted the importance of client focus and a team approach to quality improvement.
SENSE DISAMBIGUATION TECHNIQUE FOR PROVIDING MORE ACCURATE RESULTS IN WEB SEARCH
1. International Journal on Web Service Computing (IJWSC), Vol.3, No.3, September 2012
DOI : 10.5121/ijwsc.2012.3303 29
SENSE DISAMBIGUATION TECHNIQUE FOR
PROVIDING MORE ACCURATE RESULTS IN WEB
SEARCH
Rekha Jain1
and G. N. Purohit2
1
Department of Computer Science, Banasthali University, Rajasthan, India
rekha_leo2003@rediffmail.com
2
Department of Computer Science, Banasthali University, Rajasthan, India
gn_purohitjaipur@yahoo.co.in
ABSTRACT
As the web is increasing exponentially, so it is very much difficult to provide relevant information to the
information seekers. While searching some information on the web, users can easily fade out in rich
hypertext. The existing techniques provide the results that are not up to the mark. This paper focuses on the
technique that helps in offering more accurate results, especially in case of Homographs. Homograph is a
word that shares the same written form but has different meanings. The technique that shows how senses of
words can play an important role in offering accurate search results, is described in following sections.
While adopting this technique user can receive only relevant pages on the top of the search result.
KEYWORDS
Information Retrieval, Sense Disambiguation Technique, Homographs
1. INTRODUCTION
Sometimes a single word can have different senses. These words are called as polysemous words
e.g. bass can be a type of fish or it can be a musical instrument. Word Sense Disambiguation is a
process that selects a sense from a set of predefined word senses to an instance of a polysemous
word in a particular context and assigns that sense to the word. This technique considers
following two properties of a word i.e. polysemy and homonymy. Polysemy and Homonymy are
two well known semantic problems. Bank in river bank and Bank of England are homonymous.
River bed and hospital bed describe the case of polysemy property. Word Sense Disambiguation
technique is useful to find semantic understanding of the text. It is an important as well as
challenging technique in the area of NLP (Natural Language Processing), MT (Machine
Translation), Semantic Mapping, IR (Information Retrieval), IE (Information Extraction), Speech
Recognition etc.
One of the problems with Information Retrieval (IR), in case of Homographs, is to decide the
correct sense of the word because dictionary based word senses definitions are ambiguous. If
trained linguists manually tag the word sense then there are the chances that different annotations
may assign different senses to same word, so some technique is required to disambiguate a word.
Word knowledge is difficult to verbalize in dictionaries [1].
2. International Journal on Web Service Computing (IJWSC), Vol.3, No.3, September 2012
30
To disambiguate a polysemous word, two resources are necessary- 1) the context to which the
word is linked and 2) some kind of knowledge related to that word. There are four parts-of-speech
that need disambiguation- nouns, verbs, adjectives and adverbs. This paper focuses on the
technique that will resolve the ambiguity between noun polysemous words.
The remainder of paper is organized as follows- in section 2 we discuss various approaches for
resolving the sense of the word. In section 3 some knowledge resources are introduced. Section 4
discusses the applicability of Sense Disambiguation Technique, section 5 gives the brief overview
of problem and our proposed approach is discussed in section 6. Section 7 provides the results of
our developed algorithm and at last section 8 analyses the result. Finally conclusion and future
work finishes the article.
2. APPROACHES
Word Sense Disambiguation algorithms can be roughly classified into Unsupervised Approach
and Supervised Approach on the basis of training corpora.
2.1. Unsupervised Approach
In this approach training corpus is not required. This approach needs less time and power. Major
use of this approach is in MT (Machine Translation) and IR (Information Retrieval), but this
approach has worst performance as compare to supervised approach because less knowledge is
required in this approach. It has various following sub approaches-
A. Simple Approach (SA): It refers to the algorithms that consider only one type of lexical
knowledge. This approach is easy to implement but it do not have good precision and recall.
Precision is the portion of correctly classified samples among classified samples. Recall is the
portion of correctly classified samples among total samples [2, 3]. Generally the value of recall is
less than the value of precision unless all the samples are tagged.
B. Combination of Simple Approaches (CSA): It is a combination of simple approaches that
are created by simply summing up the normalized weights of individual simple approaches [4].
As multiple resources offer more confidence on a sense than a single resource does, so it usually
performs better than a single approach.
C. Iterative Approach (IA): This approach only tags the words that have high confidence on the
basis of information for sense tagged words from previous step and other lexical knowledge [5].
This approach disambiguates the nouns with 55% precision and verbs with 92.2 % precision.
D. Recursive Filtering (RF): This approach follows the same principle as IA but with some
differences like it assumes that correct sense of a target word should have stronger semantic
relationship with other words than the remaining senses. This approach does not disambiguate the
sense of all words until final step. This algorithm gradually reduces the irrelevant senses and
leaves only relevant ones within a finite number of cycles. It had been reported that this algorithm
had 68.79% precision and 68.80 % recall [6].
E. Bootstrapping (BS): This approach follows a recursive optimization algorithm which requires
few seed values instead of having a large number of training samples. This approach recursively
processes the trained model to predict the sense of new cases and returns a model of new
predicted cases. A list of 12 words is applied on this algorithm and 96.5% precision is achieved
[7]. This approach truly achieves very high precision but it is limited to disambiguate a few words
from the text.
3. International Journal on Web Service Computing (IJWSC), Vol.3, No.3, September 2012
31
2.2. Supervised Approach
This approach uses the train model of sense tagged corpora that links world knowledge to word
sense. Most recently developed WSD algorithms are supervised because of availability of training
corpora, but it does not mean that unsupervised approach is out of mode. It has the following sub
approaches-
A. Log Linear Model (LLM): It is based on the assumption that each feature is conditionally
independent of others. The probability of each sense is computed with Bayes’ Rule [8]
1
1
1
( , , | ) ( )
( | , , )
( , , )
k i i
i k
k
p c c s p s
p s c c
p c c
=
(1)
Because 1( , , )kp c c is same for all senses of target word, we can simply ignore it.
According to independence assumption:
1
1
( , , | ) ( / )
k
k i j i
j
p c c s p c s
=
= ∏
(2)
1
log ( ) log ( / )
i
k
i j i
s j
s ARGMAX p s p c s
=
= +∏
(3)
But this approach has two disadvantages 1) The concept of assumption independence is not clear
2) It needs some good techniques to smooth the terms [9].
B. Decomposable Probabilistic Models (DPM): This model fixes the false assumption of
LLM’s by setting the interdependence features of training data [10, 11]. This approach could
achieve better results if the size of training data is large enough to compute the interdependence
settings.
C. Memory Based Learning (MBL): This approach supports both numeric features as well as
symbolic features so it can be used to integrate various features into one model [12]. This
approach classifies the new cases by calculating the similarity matrix as follows-
1
( , ) ( , )
n
i i i
i
X Y w x y
=
∆ = ∏ (4)
Where
( , )
max min
i i
i i
i i
x y
x y
−
=
−
if numeric, else
( , ) 1i ix y = if i ix y≠
( , ) 0i ix y = if i ix y=
If there is no information about feature relevance the feature weight is 1, otherwise
domain knowledge bias is added to weight.
4. International Journal on Web Service Computing (IJWSC), Vol.3, No.3, September 2012
32
D. Maximum Entropy (ME): It is constraint based approach where the algorithm maximizes
the entropy of ( | )p y x . This is the conditional probability of sense Y under facts X, given a
collection of facts computed from data [13, 14].
( , ) 1if x y = if sense y is under condition x, otherwise
( , ) 0if x y =
1
1
( | ) exp ( , )
( )
i i
i
p y x f x y
Z x
=
=
∑ (5)
Parameter can be computed by numeric algorithm called as Improve Iterative Scaling
algorithm.
E. Expectation Maximum (EM): This approach solves the maximization problem that contains
incomplete information by applying an iterative approach. Incomplete information means the
contextual features are not directly associated with word senses. Expectation Maximum is a
climbing algorithm where its achievement of global maximum depends on initial values of
parameters [15]. We should be careful to initialize the parameters. This Expectation Maximum
does not require the corpus to be sense tagged as it can learn conditional probability between
hidden sense and aligned word pairs from bilingual corpora.
Table 1. Summarization of all WSD algorithms
Table-1 gives a brief summarization of all the Word Sense Disambiguation algorithms discussed
above [16]. Computing complexity is one of the major issues that must be considered whenever
there is a choice of Word Sense Disambiguation algorithm.
3. KNOWLEDGE RESOURCES
There are two categories of knowledge Resources 1) Lexical Knowledge that is released
for public use and 2) World Knowledge that is learned from training corpora [16].
5. International Journal on Web Service Computing (IJWSC), Vol.3, No.3, September 2012
33
3.1 Lexical Knowledge
It is the base for unsupervised WSD approaches. It has the following components-
i) Sense Frequency is the occurrence or frequency of each sense of word.
ii) Sense Gloss provides the sense of a word by definitions and examples. The word
sense can be tagged by counting common words between the gloss and context of
the word.
iii) Concept Trees describes the relationships between synonym, hypernym,
homonym etc. A WSD algorithm can be derived from this hierarchical concept
tree.
iv) Selection Restrictions are semantic restrictions that can be placed on word sense.
LDOCE (Longman Dictionary Of Contemporary English) provides this kind of
information.
iv) Subject Code refers to the category the sense of target word belongs to. Some
weighted indicative words are also used with subject code. These indicative words
are fetched from training corpus.
3.2 Learned World Knowledge
It is very much difficult to verbalize the World Knowledge. So some technique is required that
can automatically fetch world knowledge from contextual knowledge by machine learning
techniques. Components of Learned Knowledge are as follows-
i) Indicative Words are the words that surround the target word and help to sense the target
word. The word that is more close to the target word is more indicative word to sense.
ii) Syntactic features refer to sentence structure. They check position of the specific word. It
may be subject, direct object, indirect object etc [13].
iii) Domain Specific Knowledge is about some semantic restrictions that can be applied on
each sense of the target word. This knowledge can only be retrieved from a training
corpora and it can be attached to WSD algorithm for better learning of world knowledge
[17].
iv) Parallel Corpora is based on the concept of translation process. This process implies that
major words like nouns, verbs etc. share the same sense or concept in different languages.
These types of corpora contain two languages one is primary language and other one is
secondary language. The major words of language are aligned using third party software
[18].
4. APPLICABILITY OF WSD
Word Sense Disambiguation does not play a direct role in human language technology instead it
gives its participation into other applications like Information Retrieval (IR), Machine Translation
(MT), Word Processing etc. Another field, where WSD plays a major role is Semantic Web [16].
Here WSD participates in Ontology Learning, Building Taxonomies etc. The Information
Retrieval (IR) is open research area that needs to distinguish the senses of word that are searched
by the user and returns only pages that contain needed senses.
6. International Journal on Web Service Computing (IJWSC), Vol.3, No.3, September 2012
34
5. STATEMENT OF PROBLEM
To disambiguate a word two issues must be considered 1) context in which the word has been
used and 2) some kind of world knowledge. A human being contains the world knowledge that
helps to disambiguate the words easily. For example the word “bass” appears in a text, it needs to
be disambiguated because of its multiple senses. It may refer to the musical instrument “bass” or
it may also refer to the kind of fish “bass”. Since computers do not have world knowledge used
by human beings to disambiguate a word, they need some other resources for fulfilling this task.
Some technique is required that can resolve the ambiguity between these polysemous words.
Precision and recall are two important factors for measuring the performance of WSD. Precision
is the proportion of correctly classified instances of those classified. Recall is proportion of
correctly classified instances of total instances. In general the recall value is less than precision
value. WSD is applied whenever a semantic understanding of text is needed.
6. OUR APPROACH
There are four parts-of-speech that allow polysemy: nouns, verbs, adverbs and adjectives. Our
approach is based on supervised technique that is used to disambiguate noun polysemous words.
To disambiguate the sense of a word we need sense knowledge and contextual knowledge. Sense
knowledge comprises of lexical knowledge and world knowledge. There is no separation line
between lexical knowledge and world knowledge, usually unsupervised approaches use lexical
knowledge and supervised approached use learned world knowledge. Our approach is based on
supervised approach that uses domain specific knowledge to resolve the ambiguities between
polysemous words. Contextual knowledge contains word to be sensed and its features.
The proposed algorithm disambiguates the word sense of polysemous words when the user
performs search on Web. The approach is based on domain specific knowledge. This knowledge
can be attached with WSD algorithm by empirical methods. Proposed algorithm has two
subsections. In the first part we have applied pre-processing before sending the query to Search
Engine. In the second part or next module we would apply some mechanism that would rearrange
the pages retrieved from Search Engine according to user’s needs. This module would first
rearrange the pages according to users’ needs then on the basis of their ranks. Mostly the users
explore top 6-7 pages that are included in their search result. This module would provide the
relevant pages on the top of search result.
6.1 Algorithm
1. Receive the string entered by user to search
2. Divide the string in tokens
3. for each token
4. search its root word from dictionary
5. check the root word in the list of polysemous words
6. if found
7. retrieve the world knowledge of specific token from dictionary
8. retrieve the contextual information from the domain specified
9. create the sense disambiguation knowledge from world knowledge and contextual
information of token
10. attach the sense of word with string
11. otherwise
12. retain the token as it is
13. if more tokens available
14. go to step 4
15. pass the resultant string to Search Engine
7. International Journal on Web Service Computing (IJWSC), Vol.3, No.3, September 2012
35
6.2 System Architecture
Figure 1. System Architecture
This algorithm shows the result in form of URLs which are ranked according to the user’s domain
and their importance.
6.3 Methodology
Two users were considered in this experiment. Each user was asked to specify his/her domain of
interest. It had been reported that generally the users were interested to explore only 6-7 pages of
search result, so the query result should be relevant according to users’ interest. First user was an
Ichthyologist whose domain was to study the fishes, and second user was a Musician. This user
was interested in searching the information about various musical instruments.
7. EXPERIMENTAL EVALUATION
The disambiguation algorithm remembers the primary domain of interest and retrieves more
meaningful contents to the users.
An Ichthyologist searched the word bass via Google Search Engine and entered the word bass on
search engine interface as shown in Figure 2.
Figure 2. Results retrieved by Google Search Engine Directly
8. International Journal on Web Service Computing (IJWSC), Vol.3, No.3, September 2012
36
The results received were not up to the mark because he/she was expected the details about the
fish “bass” not about a musical instrument or anything else.
The proposed algorithm resolved the ambiguities between Noun Homographs. At the time of
searching users never bothered about the multiple meanings of the word; their only requirement is
that their relevant content must appear at top of result.
But when the same user (Ichthyologist) performed the same search through our developed
module, the result varies. Those results were more relevant as compared to earlier results as
shown in Figure 2, because the pages appear at the top of result provided the details regarding the
bass fish.
Figure 3. Results retrieved by new Algorithm-1
If the user is a musician then it is obvious that he/she is interested in searching the details for
bass, a musical instrument Figure 4 shows the results in following manner such that if a musician
searched the details for word Bass. Here the top of result provided the details for the Bass, a
musical instrument.
Figure 4. Results retrieved by new Algorithm-2
9. International Journal on Web Service Computing (IJWSC), Vol.3, No.3, September 2012
37
8. ANALYSIS OF RESULT
Figure 2 shows the result when the user directly enters bass keyword on to Google interface. Here
Google searches all the possible pages having word bass in them and then arranges them in the
descending order of their page ranks. It includes pages from all the possible domains. In new
developed algorithm user never enters search keywords on to Google interface instead he/she
performs the search via our algorithm’s search interface. The algorithm provides the result in
different manner as it can be seen in Figure 3 and Figure 4 that both the users (Ichthyologist and
Musician) enter the same word to search and disambiguation algorithm performs some
preprocessing and then passes the resultant query to Search Engine and as a result the
Ichthyologist and Musician receive respective web pages.
9. CONCLUSION AND FUTURE WORK
As specified earlier we have developed an algorithm for pre processing of query that we want to
send to Search Engine to retrieve some relevant contents from WWW. The future work related to
this area will revolve around second part of the research. Here our proposed algorithm would
rearrange the pages so that user can get more meaningful contents at the top. This rearrangement
of pages would be based on some mathematical formula which takes the value of PageRank as
one of the parameter.
REFERENCES
[1] Veronis, J.,Sense Tagging: Don't Look for the Meaning But for the Use, Workshop on Computational
Lexicography and Multimedia Dictionaries, Patras, Greece, pp. 1-9 (2000)
[2] Lesk, M. Automatic Sense Disambiguation: How to Tell a Pine Cone from and Ice Cream Cone.
Proceedings of the SIGDOC’86 Conference, ACM (1986)
[3] Galley, M., & McKeown, K., Improving Word Sense Disambiguation in Lexical Chaining,
International Joint Conferences on Artificial Intelligence (2003)
[4] Agirre, E. et al., Combining supervised and unsupervised lexical knowledge methods for word sense
disambiguation. Computer and the Humanities, Vol.34, P103-108 (2000)
[5] Mihalcea, R. & Moldovan, D., An Iterative Approach to Word Sense Disambiguation. Proceedings of
Flairs, Orlando, FL, pp. 219-223 (2000)
[6] Kwong, O.Y., Word Sense Selection in Texts:An Integrated Model, Doctoral Dissertation, University
of Cambridge (2000)
[7] Yarowsky, D., Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. Meeting of
the Association for Computational Linguistics, pp. 189-196 (1995)
[8] Yarowsky, D., Word Sense Disambiguation Using Statistical Models of Roget's Categories Trained
on Large Corpora. Proceedings of COLING-92, Nantes, France, July 1992, pp. 454-460 (1992)
[9] Chodorow, M., Leacock, C., and Miller G., 2000. A Topical/Local Classifier for Word Sense
Identification Computers and the Humanities Vol. 34, pp.115-120 (2000)
[10] Bruce, R. & Wiebe, J., Decomposable modeling in natural language processing. Computational
Linguistics, Vol. 25, No 2 (1999)
[11] O'Hara, T, Wiebe, J., & Bruce, R., Selecting Decomposable Models for Word Sense disambiguation:
The Grling-Sdm System. Computers and the Humanities, Vol. 34, pp. 159-164 (2000)
[12] Daelemans, W. et al., 1999. TiMBL: Tilburg Memory Based Learner V2.0 Reference Guide,
ILK Technical Report- ILK 99-01 (1999)
[13] Fellbaum, C. & Palmer, M., Manual and Automatic Semantic Annotation with WordNet. Proceedings
of NAACL Workshop (2001)
[14] Berger, A. et al., A maximum entropy approach to natural language processing. Computational
Linguistics, Vol. 22, No 1 (1996)
[15] Dempster A. et al., Maximum Likelihood from Incomplete Data via the EM Algorithm. J Royal
Statist Soc Series B Vol. 39, pp. 1-38 (1977)
10. International Journal on Web Service Computing (IJWSC), Vol.3, No.3, September 2012
38
[16] Xiaohua Zhou, Hyoil Han, Survey of Word Sense Disambiguation Approaches. 18th FLAIRS
Conference, Clearwater Beach, Florida (2005)
[17] Hastings, P. et al., Inferring the meaning of verbs from context Proceedings of the Twentieth Annual
Conference of the Cognitive Science Society (CogSci-98), Wisconsin, Madison (1998)
[18] Bhattacharya, I., Getoor, L., and Bengio, Y., Unsupervised sense disambiguation using bilingual
probabilistic models. Proceedings of the Annual Meeting of ACL (2004)
Authors
Rekha Jain completed her Master Degree in Computer Science from Kurukshetra
University in 2004. Now she is working as Assistant Professor in Department of “Apaji
Institute of Mathematics & Applied Computer Technology” at Banasthali University,
Rajasthan and pursuing Ph.D. under the supervision of Prof. G. N. Purohit. Her current
research interest includes Web Mining, Semantic Web and Data Mining. She has
various National and International publications and conferences.
Prof. G. N. Purohit is a Professor in Department of Mathematics & Statistics at
Banasthali University (Rajasthan). Before joining Banasthali University, he was
Professor and Head of the Department of Mathematics, University of Rajasthan,
Jaipur. He had been Chief-editor of a research journal and regular reviewer of many
journals. His present interest is in O.R., Discrete Mathematics and Communication
networks. He has published around 40 research papers in various journals.