Explicit Semantic Analysis (ESA) is an approach to measure the semantic relatedness between terms or documents based on similarities to documents of a references corpus usually Wikipedia. ESA usage has received tremendous attention in the field of natural language processing NLP and information retrieval. However, ESA utilizes a huge Wikipedia index matrix in its interpretation by multiplying a large matrix by a term vector to produce a high-dimensional vector. Consequently, the ESA process is too expensive in interpretation and similarity steps. Therefore, the efficiency of ESA will slow down because we lose a lot of time in unnecessary operations. This paper propose enhancements to ESA called optimize-ESA that reduce the dimension at the interpretation stage by computing the semantic similarity in a specific domain. The experimental results show clearly that our method correlates much better with human judgement than the full version ESA approach.
Improving Text Categorization with Semantic Knowledge in Wikipediachjshan
Text categorization, especially short text categorization, is a difficult and challenging task since the text data is sparse and multidimensional. In traditional text classification methods, document texts are represented with “Bag of Words (BOW)” text representation schema, which is based on word co-occurrence and has many limitations. In this paper, we mapped document texts to Wikipedia concepts and used the Wikipedia-concept-based document representation method to take the place of traditional BOW model for text classification. In order to overcome the weakness of ignoring the semantic relationships among terms in document representation model and utilize rich semantic knowledge in Wikipedia, we constructed a semantic matrix to enrich Wikipedia-concept-based document representation. Experimental evaluation on five real datasets of long and short text shows that our approach outperforms the traditional BOW method.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
International Journal of Engineering Research and Development is an international premier peer reviewed open access engineering and technology journal promoting the discovery, innovation, advancement and dissemination of basic and transitional knowledge in engineering, technology and related disciplines.
This document provides an overview and introduction to the statistical software R. It describes how R can be obtained and installed. R is a free and open-source software environment for statistical analysis and graphics. The document outlines the basic features of the R environment, including how to work with data and packages in R. It provides a conceptual overview of the organization of the book, which uses R and biological examples to teach statistics concepts ranging from basic to advanced topics.
This document describes a proposed concept-based mining model that aims to improve document clustering and information retrieval by extracting concepts and semantic relationships rather than just keywords. The model uses natural language processing techniques like part-of-speech tagging and parsing to extract concepts from text. It represents concepts and their relationships in a semantic network and clusters documents based on conceptual similarity rather than term frequency. The model is evaluated using singular value decomposition to increase the precision of key term and phrase extraction.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
This document presents a novel approach for clustering textual information in emails using text data mining techniques. It discusses using k-means clustering and a vector space model to group similar emails based on word patterns and frequencies. The methodology involves preprocessing emails, applying a Porter stemmer, calculating term frequencies, and using k-means to form clusters. Clusters will contain emails with similar content, allowing users to more easily process emails based on priority. This clustering approach could reduce the time users spend filtering through emails one by one.
This document summarizes a survey on string similarity matching search techniques. It discusses how string similarity matching is used to find relevant information in text collections. The document reviews different algorithms for string matching, including edit distance, NR-grep, n-grams, and approaches based on hashing and locality-sensitive hashing. It analyzes techniques like pattern matching, threshold-based joins, and vector representations. The goal is to present an overview of the field and compare algorithm performance for similarity searches.
This document summarizes a research paper that introduces a novel multi-viewpoint similarity measure for clustering text documents. The paper begins with background on commonly used similarity measures like Euclidean distance and cosine similarity. It then presents the novel multi-viewpoint measure, which considers multiple viewpoints (objects not assumed to be in the same cluster) rather than a single viewpoint. The paper proposes two new clustering criterion functions based on this measure and compares them to other algorithms on benchmark datasets. The goal is to develop a similarity measure and clustering methods that provide high-quality, consistent performance like k-means but can better handle sparse, high-dimensional text data.
The document proposes a mixed approach using existing natural language processing techniques and novel techniques to automatically construct conceptual taxonomies from text. Key steps include identifying relevant concepts and attributes from text, clustering similar concepts, computing relevance weights for concepts, and generalizing concepts using WordNet. Preliminary results suggest the approach shows promise for extending and improving automatic taxonomy construction.
Improving Text Categorization with Semantic Knowledge in Wikipediachjshan
Text categorization, especially short text categorization, is a difficult and challenging task since the text data is sparse and multidimensional. In traditional text classification methods, document texts are represented with “Bag of Words (BOW)” text representation schema, which is based on word co-occurrence and has many limitations. In this paper, we mapped document texts to Wikipedia concepts and used the Wikipedia-concept-based document representation method to take the place of traditional BOW model for text classification. In order to overcome the weakness of ignoring the semantic relationships among terms in document representation model and utilize rich semantic knowledge in Wikipedia, we constructed a semantic matrix to enrich Wikipedia-concept-based document representation. Experimental evaluation on five real datasets of long and short text shows that our approach outperforms the traditional BOW method.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
International Journal of Engineering Research and Development is an international premier peer reviewed open access engineering and technology journal promoting the discovery, innovation, advancement and dissemination of basic and transitional knowledge in engineering, technology and related disciplines.
This document provides an overview and introduction to the statistical software R. It describes how R can be obtained and installed. R is a free and open-source software environment for statistical analysis and graphics. The document outlines the basic features of the R environment, including how to work with data and packages in R. It provides a conceptual overview of the organization of the book, which uses R and biological examples to teach statistics concepts ranging from basic to advanced topics.
This document describes a proposed concept-based mining model that aims to improve document clustering and information retrieval by extracting concepts and semantic relationships rather than just keywords. The model uses natural language processing techniques like part-of-speech tagging and parsing to extract concepts from text. It represents concepts and their relationships in a semantic network and clusters documents based on conceptual similarity rather than term frequency. The model is evaluated using singular value decomposition to increase the precision of key term and phrase extraction.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
This document presents a novel approach for clustering textual information in emails using text data mining techniques. It discusses using k-means clustering and a vector space model to group similar emails based on word patterns and frequencies. The methodology involves preprocessing emails, applying a Porter stemmer, calculating term frequencies, and using k-means to form clusters. Clusters will contain emails with similar content, allowing users to more easily process emails based on priority. This clustering approach could reduce the time users spend filtering through emails one by one.
This document summarizes a survey on string similarity matching search techniques. It discusses how string similarity matching is used to find relevant information in text collections. The document reviews different algorithms for string matching, including edit distance, NR-grep, n-grams, and approaches based on hashing and locality-sensitive hashing. It analyzes techniques like pattern matching, threshold-based joins, and vector representations. The goal is to present an overview of the field and compare algorithm performance for similarity searches.
This document summarizes a research paper that introduces a novel multi-viewpoint similarity measure for clustering text documents. The paper begins with background on commonly used similarity measures like Euclidean distance and cosine similarity. It then presents the novel multi-viewpoint measure, which considers multiple viewpoints (objects not assumed to be in the same cluster) rather than a single viewpoint. The paper proposes two new clustering criterion functions based on this measure and compares them to other algorithms on benchmark datasets. The goal is to develop a similarity measure and clustering methods that provide high-quality, consistent performance like k-means but can better handle sparse, high-dimensional text data.
The document proposes a mixed approach using existing natural language processing techniques and novel techniques to automatically construct conceptual taxonomies from text. Key steps include identifying relevant concepts and attributes from text, clustering similar concepts, computing relevance weights for concepts, and generalizing concepts using WordNet. Preliminary results suggest the approach shows promise for extending and improving automatic taxonomy construction.
This document summarizes an article that proposes an automatic text summarization technique using feature terms to calculate sentence relevance. The technique uses both statistical and linguistic methods to identify semantically important sentences for creating a generic summary. It determines the relevance of sentences based on feature term ranks and performs semantic analysis of sentences with the highest ranks to select those most important for the summary. The performance is evaluated by comparing summaries to those created by human evaluators.
The document presents a new ontology matching system based on a multi-agent architecture. The system takes ontologies described in XML, RDF Schema, and OWL as input. It uses multiple matchers and filtering to generate mappings between ontology entities. The mappings are then validated. The system is implemented as a multi-agent system with different agent types responsible for resources, matching, generating mappings, and filtering/validating mappings. The architecture allows for robust, flexible, and scalable ontology matching.
A rough set based hybrid method to text categorizationNinad Samel
This document summarizes a hybrid text categorization method that combines Latent Semantic Indexing (LSI) and Rough Sets theory to reduce the dimensionality of text data and generate classification rules. It introduces LSI to reduce the feature space of text documents represented as high-dimensional vectors. Then it applies Rough Sets theory to the reduced feature space to locate a minimal set of keywords that can distinguish document classes and generate multiple knowledge bases for classification instead of a single one. The method is tested on text categorization tasks and shown to improve accuracy over previous Rough Sets approaches.
This document discusses hierarchical clustering and similarity measures for document clustering. It summarizes that hierarchical clustering creates a hierarchical decomposition of data objects through either agglomerative or divisive approaches. The success of clustering depends on the similarity measure used, with traditional measures using a single viewpoint, while multiviewpoint measures use different viewpoints to increase accuracy. The paper then focuses on applying a multiviewpoint similarity measure to hierarchical clustering of documents.
The document describes latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA represents documents as random mixtures over latent topics, characterized by distributions over words. It is a three-level hierarchical Bayesian model where documents are generated by first sampling a per-document topic distribution from a Dirichlet prior, then repeatedly sampling topics and words from these distributions. LDA addresses limitations of previous models by capturing statistical structure within and between documents through the hierarchical Bayesian formulation.
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...iosrjce
1) The document discusses an approach to measure semantic similarity between multiple documents using an enhanced suffix tree. It involves preprocessing documents, constructing a suffix tree with documents' phrases as edges, calculating weights of shared nodes using TF-IDF, and applying cosine, dice, and hellinger similarity measures to determine pairwise document similarities.
2) The approach first preprocesses documents by removing stop words, special characters, and converting to lowercase. A suffix tree is constructed with documents' phrases as edges. Shared nodes in the tree represent common phrases between documents.
3) Node weights are calculated using TF-IDF, with higher weights given to rarer phrases. Several similarity measures (cosine, dice, hellinger) are then applied
An efficient approach for semantically enhanced document clustering by using ...ijaia
Traditional techniques of document clustering do not consider the semantic relationships between words
when assigning documents to clusters. For instance, if two documents talking about the same topic do that
using different words (which may be synonyms or semantically associated), these techniques may assign
documents to different clusters. Previous research has approached this problem by enriching the document
representation with the background knowledge in an ontology. This paper presents a new approach to
enhance document clustering by exploiting the semantic knowledge contained in Wikipedia. We first map
terms within documents to their corresponding Wikipedia concepts. Then, similarity between each pair of
terms is calculated by using the Wikipedia's link structure. The document’s vector representation is then
adjusted so that terms that are semantically related gain more weight. Our approach differs from related
efforts in two aspects: first, unlink others who built their own methods of measuring similarity through the
Wikipedia categories; our approach uses a similarity measure that is modelled after the Normalized
Google Distance which is a well-known and low-cost method of measuring term similarity. Second, it is
more time efficient as it applies an algorithm for phrase extraction from documents prior to matching terms
with Wikipedia. Our approach was evaluated by being compared with different methods from the state of
the art on two different datasets. Empirical results showed that our approach improved the clustering
results as compared to other approaches.
This document presents an algorithm for semantic-based similarity measure (SBSM) to improve text clustering. The algorithm assigns semantic weights to documents terms and phrases based on their use as arguments in proposition bank notation. It calculates similarity between a document and query based on matching weighted terms and phrases. Experimental results on a dataset show the SBSM using proposition bank notation improves performance over traditional measures like cosine and Jaccard similarity. The algorithm captures semantic information within documents for more accurate similarity assessment and clustering.
An Improved Similarity Matching based Clustering Framework for Short and Sent...IJECEIAES
Text clustering plays a key role in navigation and browsing process. For an efficient text clustering, the large amount of information is grouped into meaningful clusters. Multiple text clustering techniques do not address the issues such as, high time and space complexity, inability to understand the relational and contextual attributes of the word, less robustness, risks related to privacy exposure, etc. To address these issues, an efficient text based clustering framework is proposed. The Reuters dataset is chosen as the input dataset. Once the input dataset is preprocessed, the similarity between the words are computed using the cosine similarity. The similarities between the components are compared and the vector data is created. From the vector data the clustering particle is computed. To optimize the clustering results, mutation is applied to the vector data. The performance the proposed text based clustering framework is analyzed using the metrics such as Mean Square Error (MSE), Peak Signal Noise Ratio (PSNR) and Processing time. From the experimental results, it is found that, the proposed text based clustering framework produced optimal MSE, PSNR and processing time when compared to the existing Fuzzy C-Means (FCM) and Pairwise Random Swap (PRS) methods.
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONIJDKP
This article will introduce some approaches for improving text categorization models by integrating
previously imported ontologies. From the Reuters Corpus Volume I (RCV1) dataset, some categories very
similar in content and related to telecommunications, Internet and computer areas were selected for models
experiments. Several domain ontologies, covering these areas were built and integrated to categorization
models for their improvements.
Review of Various Text Categorization Methodsiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...IJERA Editor
Assigning documents to related categories is critical task which is used for effective document retrieval. Automatic text classification is the process of assigning new text document to the predefined categories based on its content. In this paper, we implemented and performed comparison of Naïve Bayes and Centroid-based algorithms for effective document categorization of English language text. In Centroid Based algorithm, we used Arithmetical Average Centroid (AAC) and Cumuli Geometric Centroid (CGC) methods to calculate centroid of each class. Experiment is performed on R-52 dataset of Reuters-21578 corpus. Micro Average F1 measure is used to evaluate the performance of classifiers. Experimental results show that Micro Average F1 value for NB is greatest among all followed by Micro Average F1 value of CGC which is greater than Micro Average F1 of AAC. All these results are valuable for future research
Query Answering Approach Based on Document SummarizationIJMER
The growing of online information obliged the availability of a thorough research in the
domain of automatic text summarization within the Natural Language Processing (NLP)
community.The aim of this paper is to propose a novel approach for a language independent automatic
summarization approach that combines three main approaches. The Rhetorical Structure Theory
(RST), the query processing approach, and the Network Representationapproach (NRA). RST, as a
theory of major aspect for the structure of natural text, is used to extract the semantic relation behind
the text.Query processing approachclassifies the question type and finds the answer in a way that suits
the user’s needs. The NRA is used to create a graph representing the extracted semantic relation. The
output is an answer, which not only responses to the question, but also gives the user an opportunity to
find additional information that is related to the question.We implemented the proposed approach. As a
case study, the implemented approachis applied on Arabic text in the agriculture field. The
implemented approach succeeded in summarizing extension documents according to user's query. The
approach results have been evaluated using Recall, Precision and F-score measures.
In this paper we tried to correlate text sequences those provides common topics for semantic clues. We propose a two step method for asynchronous text mining. Step one check for the common topics in the sequences and isolates these with their timestamps. Step two takes the topic and tries to give the timestamp of the text document. After multiple repetitions of step two, we could give optimum result.
Text Segmentation for Online Subjective Examination using Machine LearningIRJET Journal
This document discusses using k-Nearest Neighbor (K-NN) machine learning for text segmentation of online exams. K-NN is an instance-based learning method that computes similarity between feature vectors to determine similarity between texts. The goal is to implement natural language processing using text segmentation, which provides benefits. It reviews related work applying various machine learning methods like K-NN, support vector machines, decision trees to tasks like text categorization and clustering.
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE ijdms
Nowadays, document clustering is considered as a da
ta intensive task due to the dramatic, fast increas
e in
the number of available documents. Nevertheless, th
e features that represent those documents are also
too
large. The most common method for representing docu
ments is the vector space model, which represents
document features as a bag of words and does not re
present semantic relations between words. In this
paper we introduce a distributed implementation for
the bisecting k-means using MapReduce programming
model. The aim behind our proposed implementation i
s to solve the problem of clustering intensive data
documents. In addition, we propose integrating the
WordNet ontology with bisecting k-means in order to
utilize the semantic relations between words to enh
ance document clustering results. Our presented
experimental results show that using lexical catego
ries for nouns only enhances internal evaluation
measures of document clustering; and decreases the
documents features from thousands to tens features.
Our experiments were conducted using Amazon ElasticMapReduce to deploy the Bisecting k-means
algorithm
AN EFFICIENT APPROACH FOR SEMANTICALLYENHANCED DOCUMENT CLUSTERING BY USING W...ijaia
This document presents a new approach to improve document clustering by exploiting the semantic relationships between terms contained in Wikipedia. The approach first maps terms within documents to corresponding Wikipedia concepts. It then calculates the semantic similarity between terms using Wikipedia's link structure. The document vectors are adjusted so that semantically related terms gain more weight. The approach differs from previous work by using a well-known measure of semantic similarity based on Normalized Google Distance, and by applying phrase extraction to more efficiently map terms to Wikipedia concepts. An evaluation on two datasets found the approach improved clustering results over other state-of-the-art methods.
Abstract: Traditional approaches for document classification need data which is labelled for the construction reliable classifiers which are even accurate. Unfortunately, data which is already labelled are rarely available, and often too costly to obtain. For the given learning task for which data which is trained is unavailable, abundant labelled data may be there for a different and related domain. One would like to use the related labelled data as auxiliary information to accomplish the classification task in the target domain. Recently, the paradigm of transfer learning has been introduced to enable effective learning strategies when auxiliary data obey a different probability distribution. A co-clustering based classification algorithm has been previously proposed to tackle cross-domain text classification. In this work, we extend the idea underlying this approach by making the latent semantic relationship between the two domains explicit. This goal is achieved with the use of Wikipedia. As a result, the pathway that allows propagating labels between the two domains not only captures common words, but also semantic concepts based on the content of documents. We empirically demonstrate the efficacy of our semantic-based approach to cross-domain classification using a variety of real data.Keywords: Classification, Clustering, Cross-domain Text Classification, Co-clustering, Labelled data, Traditional Approaches.
Title: Co-Clustering For Cross-Domain Text Classification
Author: Rayala Venkat, Mahanthi Kasaragadda
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Paper Publications
A comparative analysis of particle swarm optimization and k means algorithm f...ijnlc
The volume of digitized text documents on the web have been increasing rapidly. As there is huge collection
of data on the web there is a need for grouping(clustering) the documents into clusters for speedy
information retrieval. Clustering of documents is collection of documents into groups such that the
documents within each group are similar to each other and not to documents of other groups. Quality of
clustering result depends greatly on the representation of text and the clustering algorithm. This paper
presents a comparative analysis of three algorithms namely K-means, Particle swarm Optimization (PSO)
and hybrid PSO+K-means algorithm for clustering of text documents using WordNet. The common way of
representing a text document is bag of terms. The bag of terms representation is often unsatisfactory as it
does not exploit the semantics. In this paper, texts are represented in terms of synsets corresponding to a
word. Bag of terms data representation of text is thus enriched with synonyms from WordNet. K-means,
Particle Swarm Optimization (PSO) and hybrid PSO+K-means algorithms are applied for clustering of
text in Nepali language. Experimental evaluation is performed by using intra cluster similarity and inter
cluster similarity.
This document evaluates the reliability of using Wikipedia as a source for cross-lingual query translation between English and Portuguese in the medical domain. Experiments were conducted translating single-word, two-word, and three-word queries between the two languages using Wikipedia links. Results showed Wikipedia coverage of single-word queries was around 81% for English and 80% for Portuguese. For query translation, coverage was 60% from English to Portuguese and 88% from Portuguese to English for single-word queries. Coverage decreased as query length increased. The study demonstrates Wikipedia has potential for cross-lingual information retrieval but coverage varies with query complexity.
Biomedical indexing and retrieval system based on language modeling approachijseajournal
This summarizes a research paper that proposes a biomedical indexing and retrieval system called BIOINSY. It uses a language modeling approach to select the best Medical Subject Headings (MeSH) descriptors to index medical articles from sources like PUBMED. The system first preprocesses articles by splitting text, stemming words, and removing stop words. It then extracts terms using a hybrid linguistic and statistical approach. Terms are weighted based on semantic relationships in MeSH, not just statistics. Descriptors are selected by disambiguating terms and estimating the probability a descriptor was generated by the article's language model. Experiments showed the effectiveness of this conceptual indexing approach.
Cooperating Techniques for Extracting Conceptual Taxonomies from TextFulvio Rotella
The document proposes a mixed approach using existing natural language processing techniques and novel techniques to automatically construct conceptual taxonomies from text. It identifies relevant concepts from text using keyword extraction, clustering, and computing relevance weights. It then generalizes similar concepts using WordNet to group concepts and disambiguate word senses. Preliminary evaluations show promising initial results.
This document summarizes an article that proposes an automatic text summarization technique using feature terms to calculate sentence relevance. The technique uses both statistical and linguistic methods to identify semantically important sentences for creating a generic summary. It determines the relevance of sentences based on feature term ranks and performs semantic analysis of sentences with the highest ranks to select those most important for the summary. The performance is evaluated by comparing summaries to those created by human evaluators.
The document presents a new ontology matching system based on a multi-agent architecture. The system takes ontologies described in XML, RDF Schema, and OWL as input. It uses multiple matchers and filtering to generate mappings between ontology entities. The mappings are then validated. The system is implemented as a multi-agent system with different agent types responsible for resources, matching, generating mappings, and filtering/validating mappings. The architecture allows for robust, flexible, and scalable ontology matching.
A rough set based hybrid method to text categorizationNinad Samel
This document summarizes a hybrid text categorization method that combines Latent Semantic Indexing (LSI) and Rough Sets theory to reduce the dimensionality of text data and generate classification rules. It introduces LSI to reduce the feature space of text documents represented as high-dimensional vectors. Then it applies Rough Sets theory to the reduced feature space to locate a minimal set of keywords that can distinguish document classes and generate multiple knowledge bases for classification instead of a single one. The method is tested on text categorization tasks and shown to improve accuracy over previous Rough Sets approaches.
This document discusses hierarchical clustering and similarity measures for document clustering. It summarizes that hierarchical clustering creates a hierarchical decomposition of data objects through either agglomerative or divisive approaches. The success of clustering depends on the similarity measure used, with traditional measures using a single viewpoint, while multiviewpoint measures use different viewpoints to increase accuracy. The paper then focuses on applying a multiviewpoint similarity measure to hierarchical clustering of documents.
The document describes latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA represents documents as random mixtures over latent topics, characterized by distributions over words. It is a three-level hierarchical Bayesian model where documents are generated by first sampling a per-document topic distribution from a Dirichlet prior, then repeatedly sampling topics and words from these distributions. LDA addresses limitations of previous models by capturing statistical structure within and between documents through the hierarchical Bayesian formulation.
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...iosrjce
1) The document discusses an approach to measure semantic similarity between multiple documents using an enhanced suffix tree. It involves preprocessing documents, constructing a suffix tree with documents' phrases as edges, calculating weights of shared nodes using TF-IDF, and applying cosine, dice, and hellinger similarity measures to determine pairwise document similarities.
2) The approach first preprocesses documents by removing stop words, special characters, and converting to lowercase. A suffix tree is constructed with documents' phrases as edges. Shared nodes in the tree represent common phrases between documents.
3) Node weights are calculated using TF-IDF, with higher weights given to rarer phrases. Several similarity measures (cosine, dice, hellinger) are then applied
An efficient approach for semantically enhanced document clustering by using ...ijaia
Traditional techniques of document clustering do not consider the semantic relationships between words
when assigning documents to clusters. For instance, if two documents talking about the same topic do that
using different words (which may be synonyms or semantically associated), these techniques may assign
documents to different clusters. Previous research has approached this problem by enriching the document
representation with the background knowledge in an ontology. This paper presents a new approach to
enhance document clustering by exploiting the semantic knowledge contained in Wikipedia. We first map
terms within documents to their corresponding Wikipedia concepts. Then, similarity between each pair of
terms is calculated by using the Wikipedia's link structure. The document’s vector representation is then
adjusted so that terms that are semantically related gain more weight. Our approach differs from related
efforts in two aspects: first, unlink others who built their own methods of measuring similarity through the
Wikipedia categories; our approach uses a similarity measure that is modelled after the Normalized
Google Distance which is a well-known and low-cost method of measuring term similarity. Second, it is
more time efficient as it applies an algorithm for phrase extraction from documents prior to matching terms
with Wikipedia. Our approach was evaluated by being compared with different methods from the state of
the art on two different datasets. Empirical results showed that our approach improved the clustering
results as compared to other approaches.
This document presents an algorithm for semantic-based similarity measure (SBSM) to improve text clustering. The algorithm assigns semantic weights to documents terms and phrases based on their use as arguments in proposition bank notation. It calculates similarity between a document and query based on matching weighted terms and phrases. Experimental results on a dataset show the SBSM using proposition bank notation improves performance over traditional measures like cosine and Jaccard similarity. The algorithm captures semantic information within documents for more accurate similarity assessment and clustering.
An Improved Similarity Matching based Clustering Framework for Short and Sent...IJECEIAES
Text clustering plays a key role in navigation and browsing process. For an efficient text clustering, the large amount of information is grouped into meaningful clusters. Multiple text clustering techniques do not address the issues such as, high time and space complexity, inability to understand the relational and contextual attributes of the word, less robustness, risks related to privacy exposure, etc. To address these issues, an efficient text based clustering framework is proposed. The Reuters dataset is chosen as the input dataset. Once the input dataset is preprocessed, the similarity between the words are computed using the cosine similarity. The similarities between the components are compared and the vector data is created. From the vector data the clustering particle is computed. To optimize the clustering results, mutation is applied to the vector data. The performance the proposed text based clustering framework is analyzed using the metrics such as Mean Square Error (MSE), Peak Signal Noise Ratio (PSNR) and Processing time. From the experimental results, it is found that, the proposed text based clustering framework produced optimal MSE, PSNR and processing time when compared to the existing Fuzzy C-Means (FCM) and Pairwise Random Swap (PRS) methods.
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONIJDKP
This article will introduce some approaches for improving text categorization models by integrating
previously imported ontologies. From the Reuters Corpus Volume I (RCV1) dataset, some categories very
similar in content and related to telecommunications, Internet and computer areas were selected for models
experiments. Several domain ontologies, covering these areas were built and integrated to categorization
models for their improvements.
Review of Various Text Categorization Methodsiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...IJERA Editor
Assigning documents to related categories is critical task which is used for effective document retrieval. Automatic text classification is the process of assigning new text document to the predefined categories based on its content. In this paper, we implemented and performed comparison of Naïve Bayes and Centroid-based algorithms for effective document categorization of English language text. In Centroid Based algorithm, we used Arithmetical Average Centroid (AAC) and Cumuli Geometric Centroid (CGC) methods to calculate centroid of each class. Experiment is performed on R-52 dataset of Reuters-21578 corpus. Micro Average F1 measure is used to evaluate the performance of classifiers. Experimental results show that Micro Average F1 value for NB is greatest among all followed by Micro Average F1 value of CGC which is greater than Micro Average F1 of AAC. All these results are valuable for future research
Query Answering Approach Based on Document SummarizationIJMER
The growing of online information obliged the availability of a thorough research in the
domain of automatic text summarization within the Natural Language Processing (NLP)
community.The aim of this paper is to propose a novel approach for a language independent automatic
summarization approach that combines three main approaches. The Rhetorical Structure Theory
(RST), the query processing approach, and the Network Representationapproach (NRA). RST, as a
theory of major aspect for the structure of natural text, is used to extract the semantic relation behind
the text.Query processing approachclassifies the question type and finds the answer in a way that suits
the user’s needs. The NRA is used to create a graph representing the extracted semantic relation. The
output is an answer, which not only responses to the question, but also gives the user an opportunity to
find additional information that is related to the question.We implemented the proposed approach. As a
case study, the implemented approachis applied on Arabic text in the agriculture field. The
implemented approach succeeded in summarizing extension documents according to user's query. The
approach results have been evaluated using Recall, Precision and F-score measures.
In this paper we tried to correlate text sequences those provides common topics for semantic clues. We propose a two step method for asynchronous text mining. Step one check for the common topics in the sequences and isolates these with their timestamps. Step two takes the topic and tries to give the timestamp of the text document. After multiple repetitions of step two, we could give optimum result.
Text Segmentation for Online Subjective Examination using Machine LearningIRJET Journal
This document discusses using k-Nearest Neighbor (K-NN) machine learning for text segmentation of online exams. K-NN is an instance-based learning method that computes similarity between feature vectors to determine similarity between texts. The goal is to implement natural language processing using text segmentation, which provides benefits. It reviews related work applying various machine learning methods like K-NN, support vector machines, decision trees to tasks like text categorization and clustering.
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE ijdms
Nowadays, document clustering is considered as a da
ta intensive task due to the dramatic, fast increas
e in
the number of available documents. Nevertheless, th
e features that represent those documents are also
too
large. The most common method for representing docu
ments is the vector space model, which represents
document features as a bag of words and does not re
present semantic relations between words. In this
paper we introduce a distributed implementation for
the bisecting k-means using MapReduce programming
model. The aim behind our proposed implementation i
s to solve the problem of clustering intensive data
documents. In addition, we propose integrating the
WordNet ontology with bisecting k-means in order to
utilize the semantic relations between words to enh
ance document clustering results. Our presented
experimental results show that using lexical catego
ries for nouns only enhances internal evaluation
measures of document clustering; and decreases the
documents features from thousands to tens features.
Our experiments were conducted using Amazon ElasticMapReduce to deploy the Bisecting k-means
algorithm
AN EFFICIENT APPROACH FOR SEMANTICALLYENHANCED DOCUMENT CLUSTERING BY USING W...ijaia
This document presents a new approach to improve document clustering by exploiting the semantic relationships between terms contained in Wikipedia. The approach first maps terms within documents to corresponding Wikipedia concepts. It then calculates the semantic similarity between terms using Wikipedia's link structure. The document vectors are adjusted so that semantically related terms gain more weight. The approach differs from previous work by using a well-known measure of semantic similarity based on Normalized Google Distance, and by applying phrase extraction to more efficiently map terms to Wikipedia concepts. An evaluation on two datasets found the approach improved clustering results over other state-of-the-art methods.
Abstract: Traditional approaches for document classification need data which is labelled for the construction reliable classifiers which are even accurate. Unfortunately, data which is already labelled are rarely available, and often too costly to obtain. For the given learning task for which data which is trained is unavailable, abundant labelled data may be there for a different and related domain. One would like to use the related labelled data as auxiliary information to accomplish the classification task in the target domain. Recently, the paradigm of transfer learning has been introduced to enable effective learning strategies when auxiliary data obey a different probability distribution. A co-clustering based classification algorithm has been previously proposed to tackle cross-domain text classification. In this work, we extend the idea underlying this approach by making the latent semantic relationship between the two domains explicit. This goal is achieved with the use of Wikipedia. As a result, the pathway that allows propagating labels between the two domains not only captures common words, but also semantic concepts based on the content of documents. We empirically demonstrate the efficacy of our semantic-based approach to cross-domain classification using a variety of real data.Keywords: Classification, Clustering, Cross-domain Text Classification, Co-clustering, Labelled data, Traditional Approaches.
Title: Co-Clustering For Cross-Domain Text Classification
Author: Rayala Venkat, Mahanthi Kasaragadda
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Paper Publications
A comparative analysis of particle swarm optimization and k means algorithm f...ijnlc
The volume of digitized text documents on the web have been increasing rapidly. As there is huge collection
of data on the web there is a need for grouping(clustering) the documents into clusters for speedy
information retrieval. Clustering of documents is collection of documents into groups such that the
documents within each group are similar to each other and not to documents of other groups. Quality of
clustering result depends greatly on the representation of text and the clustering algorithm. This paper
presents a comparative analysis of three algorithms namely K-means, Particle swarm Optimization (PSO)
and hybrid PSO+K-means algorithm for clustering of text documents using WordNet. The common way of
representing a text document is bag of terms. The bag of terms representation is often unsatisfactory as it
does not exploit the semantics. In this paper, texts are represented in terms of synsets corresponding to a
word. Bag of terms data representation of text is thus enriched with synonyms from WordNet. K-means,
Particle Swarm Optimization (PSO) and hybrid PSO+K-means algorithms are applied for clustering of
text in Nepali language. Experimental evaluation is performed by using intra cluster similarity and inter
cluster similarity.
This document evaluates the reliability of using Wikipedia as a source for cross-lingual query translation between English and Portuguese in the medical domain. Experiments were conducted translating single-word, two-word, and three-word queries between the two languages using Wikipedia links. Results showed Wikipedia coverage of single-word queries was around 81% for English and 80% for Portuguese. For query translation, coverage was 60% from English to Portuguese and 88% from Portuguese to English for single-word queries. Coverage decreased as query length increased. The study demonstrates Wikipedia has potential for cross-lingual information retrieval but coverage varies with query complexity.
Biomedical indexing and retrieval system based on language modeling approachijseajournal
This summarizes a research paper that proposes a biomedical indexing and retrieval system called BIOINSY. It uses a language modeling approach to select the best Medical Subject Headings (MeSH) descriptors to index medical articles from sources like PUBMED. The system first preprocesses articles by splitting text, stemming words, and removing stop words. It then extracts terms using a hybrid linguistic and statistical approach. Terms are weighted based on semantic relationships in MeSH, not just statistics. Descriptors are selected by disambiguating terms and estimating the probability a descriptor was generated by the article's language model. Experiments showed the effectiveness of this conceptual indexing approach.
Cooperating Techniques for Extracting Conceptual Taxonomies from TextFulvio Rotella
The document proposes a mixed approach using existing natural language processing techniques and novel techniques to automatically construct conceptual taxonomies from text. It identifies relevant concepts from text using keyword extraction, clustering, and computing relevance weights. It then generalizes similar concepts using WordNet to group concepts and disambiguate word senses. Preliminary evaluations show promising initial results.
Semantic Based Document Clustering Using Lexical ChainsIRJET Journal
The document proposes a semantic-based document clustering approach using lexical chains. It uses WordNet to perform word sense disambiguation on documents to identify core semantic features. Lexical chains of semantically related words are then generated from the documents based on these features. The lexical chains represent the semantic content of documents and are used to cluster the documents. The results show improved clustering performance compared to traditional approaches. The approach aims to address challenges in text clustering like extracting core semantics, assigning meaningful cluster descriptions, and vocabulary diversity.
IRJET-Semantic Based Document Clustering Using Lexical ChainsIRJET Journal
This document discusses a semantic-based document clustering approach using lexical chains. It proposes using WordNet to perform word sense disambiguation on documents to extract core semantic features represented as lexical chains. Lexical chains identify semantically related words in a text based on relations like synonyms and hypernyms. Documents are then clustered based on the lexical chains extracted. The approach aims to overcome issues in traditional clustering like synonyms and polysemy by incorporating semantic information from WordNet ontology. It is argued that identifying themes based on disambiguated semantic features extracted via lexical chains can improve text clustering performance compared to bag-of-words models. An evaluation of the approach showed better results when using a threshold of 50% for lexical chain selection.
IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET Journal
The document describes a study that uses GloVe word embeddings to measure semantic similarity between short texts. GloVe is an unsupervised learning algorithm for obtaining vector representations of words. The study trains GloVe word embeddings on a large corpus, then uses the embeddings to encode short texts and calculate their semantic similarity, comparing the accuracy to methods that use Word2Vec embeddings. It aims to show that GloVe embeddings may provide better performance for short text semantic similarity tasks.
Semantic Relatedness of Web Resources by XESA - Philipp SchollCROKODIl consortium
This document discusses using extended explicit semantic analysis (XESA) to measure semantic relatedness between short text snippets for recommendation purposes. It proposes enhancing ESA by incorporating additional semantic information from Wikipedia, such as article links and categories. An evaluation compares the performance of ESA, XESA using the article graph, XESA using categories, and a combination. The results show that XESA using the article graph improves over ESA by up to 9% and performs best for recommending related snippets.
Classification of News and Research Articles Using Text Pattern MiningIOSR Journals
This document summarizes a research paper that proposes a method for classifying news and research articles using text pattern mining. The method involves preprocessing text to remove stop words and perform stemming. Frequent and closed patterns are then discovered from the preprocessed text. These patterns are structured into a taxonomy and deployed to classify new documents. The method also involves evolving patterns by reshuffling term supports within patterns to reduce the effects of noise from negative documents. Over 80% of documents were successfully classified using this pattern-based approach.
Survey On Building A Database Driven Reverse DictionaryEditor IJMTER
Reverse dictionaries are widely used for a reference work that is organized by concepts,
phrases, or the definitions of words. This paper describe the many challenges inherent in building a
reverse lexicon, and map drawback to the well known abstract similarity problem The criterion web
search engines are basic versions of system; they take benefit of huge scale which permits inferring
general interest concerning documents from link information. This paper describe the basic study of
database driven reverse dictionary using three large-scale dataset namely person names, general English
words and biomedical concepts. This paper analyzes difficulties arising in the use of documents
produced by Reverse dictionary.
Relevance feature discovery for text miningredpel dot com
The document discusses relevance feature discovery for text mining. It presents an innovative model that discovers both positive and negative patterns in text documents as higher-level features and uses them to classify terms into categories and update term weights based on their specificity and distribution in patterns. Experiments on standard datasets show the proposed model outperforms both term-based and pattern-based methods.
Enhanced TextRank using weighted word embedding for text summarizationIJECEIAES
The length of a news article may influence people’s interest to read the article. In this case, text summarization can help to create a shorter representative version of an article to reduce people’s read time. This paper proposes to use weighted word embedding based on Word2Vec, FastText, and bidirectional encoder representations from transformers (BERT) models to enhance the TextRank summarization algorithm. The use of weighted word embedding is aimed to create better sentence representation, in order to produce more accurate summaries. The results show that using (unweighted) word embedding significantly improves the performance of the TextRank algorithm, with the best performance gained by the summarization system using BERT word embedding. When each word embedding is weighed using term frequency-inverse document frequency (TF-IDF), the performance for all systems using unweighted word embedding further significantly improve, with the biggest improvement achieved by the systems using Word2Vec (with 6.80% to 12.92% increase) and FastText (with 7.04% to 12.78% increase). Overall, our systems using weighted word embedding can outperform the TextRank method by up to 17.33% in ROUGE-1 and 30.01% in ROUGE-2. This demonstrates the effectiveness of weighted word embedding in the TextRank algorithm for text summarization.
An Incremental Method For Meaning Elicitation Of A Domain OntologyAudrey Britton
This document describes MELIS (Meaning Elicitation and Lexical Integration System), a tool developed to support the annotation of data sources with conceptual and lexical information during the ontology generation process in MOMIS (Mediator envirOnment for Multiple Information Systems). MELIS improves upon MOMIS by enabling more automated annotation of data sources and by extracting relationships between lexical elements using lexical and domain knowledge to provide a richer semantics. The document outlines the MOMIS architecture integrated with MELIS and explains how MELIS supports key steps in the ontology creation process, including source annotation, common thesaurus generation, and relationship extraction.
Effect of word embedding vector dimensionality on sentiment analysis through ...IAESIJAI
Word embedding has become the most popular method of lexical description
in a given context in the natural language processing domain, especially
through the word to vector (Word2Vec) and global vectors (GloVe)
implementations. Since GloVe is a pre-trained model that provides access to
word mapping vectors on many dimensionalities, a large number of
applications rely on its prowess, especially in the field of sentiment analysis.
However, in the literature, we found that in many cases, GloVe is
implemented with arbitrary dimensionalities (often 300d) regardless of the
length of the text to be analyzed. In this work, we conducted a study that
identifies the effect of the dimensionality of word embedding mapping
vectors on short and long texts in a sentiment analysis context. The results
suggest that as the dimensionality of the vectors increases, the performance
metrics of the model also increase for long texts. In contrast, for short texts,
we recorded a threshold at which dimensionality does not matter.
Ontology Based Approach for Classifying Biomedical Text AbstractsWaqas Tariq
Classifying biomedical literature is a difficult and challenging task, especially when a large number of biomedical articles should be organized into a hierarchical structure. Due to this problem, various classification methods were proposed by many researchers for classifying biomedical literature in order to help users find relevant articles on the web. In this paper, we propose a new approach to classifying a collection of biomedical text abstracts by using ontology alignment algorithm that we have developed. To accomplish our goal, we construct the OHSUMED disease hierarchy as the initial training hierarchy and the Medline abstract disease hierarchies as our testing hierarchy. For enriching our training hierarchy, we use the relevant features that extracted from selected categories in the OHSUMED dataset as feature vectors. These feature vectors then are mapped to each node or concept in the OHSUMED disease hierarchy according to their specific category. Afterward, we align and match the concepts in both hierarchies using our ontology alignment algorithm for finding probable concepts or categories. Subsequently, we compute the cosine similarity score between the feature vectors in probable concepts, in the genrichedh OHSUMED disease hierarchy and the Medline abstract disease hierarchy. Finally, we predict a category to the new Medline abstracts based on the highest cosine similarity score. The results obtained from the experiments demonstrate that our proposed approach for hierarchical classification performs slightly better than the multi-class flat classification.
An in-depth review on News Classification through NLPIRJET Journal
This document provides an in-depth literature review of news classification through natural language processing (NLP). It discusses several existing approaches to news classification, including models that use convolutional neural networks (CNNs), graph-based approaches, and attention mechanisms. The document also notes that current search engines often return too many irrelevant results, so classification could help layer search results. It concludes that while many techniques have been developed, inconsistencies remain in effectively classifying news, so further research on combining NLP, feature extraction, and fuzzy logic is needed.
Similar to Towards optimize-ESA for text semantic similarity: A case study of biomedical text (20)
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...IJECEIAES
Medical image analysis has witnessed significant advancements with deep learning techniques. In the domain of brain tumor segmentation, the ability to
precisely delineate tumor boundaries from magnetic resonance imaging (MRI)
scans holds profound implications for diagnosis. This study presents an ensemble convolutional neural network (CNN) with transfer learning, integrating
the state-of-the-art Deeplabv3+ architecture with the ResNet18 backbone. The
model is rigorously trained and evaluated, exhibiting remarkable performance
metrics, including an impressive global accuracy of 99.286%, a high-class accuracy of 82.191%, a mean intersection over union (IoU) of 79.900%, a weighted
IoU of 98.620%, and a Boundary F1 (BF) score of 83.303%. Notably, a detailed comparative analysis with existing methods showcases the superiority of
our proposed model. These findings underscore the model’s competence in precise brain tumor localization, underscoring its potential to revolutionize medical
image analysis and enhance healthcare outcomes. This research paves the way
for future exploration and optimization of advanced CNN models in medical
imaging, emphasizing addressing false positives and resource efficiency.
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
Neural network optimizer of proportional-integral-differential controller par...IJECEIAES
Wide application of proportional-integral-differential (PID)-regulator in industry requires constant improvement of methods of its parameters adjustment. The paper deals with the issues of optimization of PID-regulator parameters with the use of neural network technology methods. A methodology for choosing the architecture (structure) of neural network optimizer is proposed, which consists in determining the number of layers, the number of neurons in each layer, as well as the form and type of activation function. Algorithms of neural network training based on the application of the method of minimizing the mismatch between the regulated value and the target value are developed. The method of back propagation of gradients is proposed to select the optimal training rate of neurons of the neural network. The neural network optimizer, which is a superstructure of the linear PID controller, allows increasing the regulation accuracy from 0.23 to 0.09, thus reducing the power consumption from 65% to 53%. The results of the conducted experiments allow us to conclude that the created neural superstructure may well become a prototype of an automatic voltage regulator (AVR)-type industrial controller for tuning the parameters of the PID controller.
An improved modulation technique suitable for a three level flying capacitor ...IJECEIAES
This research paper introduces an innovative modulation technique for controlling a 3-level flying capacitor multilevel inverter (FCMLI), aiming to streamline the modulation process in contrast to conventional methods. The proposed
simplified modulation technique paves the way for more straightforward and
efficient control of multilevel inverters, enabling their widespread adoption and
integration into modern power electronic systems. Through the amalgamation of
sinusoidal pulse width modulation (SPWM) with a high-frequency square wave
pulse, this controlling technique attains energy equilibrium across the coupling
capacitor. The modulation scheme incorporates a simplified switching pattern
and a decreased count of voltage references, thereby simplifying the control
algorithm.
A review on features and methods of potential fishing zoneIJECEIAES
This review focuses on the importance of identifying potential fishing zones in seawater for sustainable fishing practices. It explores features like sea surface temperature (SST) and sea surface height (SSH), along with classification methods such as classifiers. The features like SST, SSH, and different classifiers used to classify the data, have been figured out in this review study. This study underscores the importance of examining potential fishing zones using advanced analytical techniques. It thoroughly explores the methodologies employed by researchers, covering both past and current approaches. The examination centers on data characteristics and the application of classification algorithms for classification of potential fishing zones. Furthermore, the prediction of potential fishing zones relies significantly on the effectiveness of classification algorithms. Previous research has assessed the performance of models like support vector machines, naïve Bayes, and artificial neural networks (ANN). In the previous result, the results of support vector machine (SVM) were 97.6% more accurate than naive Bayes's 94.2% to classify test data for fisheries classification. By considering the recent works in this area, several recommendations for future works are presented to further improve the performance of the potential fishing zone models, which is important to the fisheries community.
Electrical signal interference minimization using appropriate core material f...IJECEIAES
As demand for smaller, quicker, and more powerful devices rises, Moore's law is strictly followed. The industry has worked hard to make little devices that boost productivity. The goal is to optimize device density. Scientists are reducing connection delays to improve circuit performance. This helped them understand three-dimensional integrated circuit (3D IC) concepts, which stack active devices and create vertical connections to diminish latency and lower interconnects. Electrical involvement is a big worry with 3D integrates circuits. Researchers have developed and tested through silicon via (TSV) and substrates to decrease electrical wave involvement. This study illustrates a novel noise coupling reduction method using several electrical involvement models. A 22% drop in electrical involvement from wave-carrying to victim TSVs introduces this new paradigm and improves system performance even at higher THz frequencies.
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
Bibliometric analysis highlighting the role of women in addressing climate ch...IJECEIAES
Fossil fuel consumption increased quickly, contributing to climate change
that is evident in unusual flooding and draughts, and global warming. Over
the past ten years, women's involvement in society has grown dramatically,
and they succeeded in playing a noticeable role in reducing climate change.
A bibliometric analysis of data from the last ten years has been carried out to
examine the role of women in addressing the climate change. The analysis's
findings discussed the relevant to the sustainable development goals (SDGs),
particularly SDG 7 and SDG 13. The results considered contributions made
by women in the various sectors while taking geographic dispersion into
account. The bibliometric analysis delves into topics including women's
leadership in environmental groups, their involvement in policymaking, their
contributions to sustainable development projects, and the influence of
gender diversity on attempts to mitigate climate change. This study's results
highlight how women have influenced policies and actions related to climate
change, point out areas of research deficiency and recommendations on how
to increase role of the women in addressing the climate change and
achieving sustainability. To achieve more successful results, this initiative
aims to highlight the significance of gender equality and encourage
inclusivity in climate change decision-making processes.
Voltage and frequency control of microgrid in presence of micro-turbine inter...IJECEIAES
The active and reactive load changes have a significant impact on voltage
and frequency. In this paper, in order to stabilize the microgrid (MG) against
load variations in islanding mode, the active and reactive power of all
distributed generators (DGs), including energy storage (battery), diesel
generator, and micro-turbine, are controlled. The micro-turbine generator is
connected to MG through a three-phase to three-phase matrix converter, and
the droop control method is applied for controlling the voltage and
frequency of MG. In addition, a method is introduced for voltage and
frequency control of micro-turbines in the transition state from gridconnected mode to islanding mode. A novel switching strategy of the matrix
converter is used for converting the high-frequency output voltage of the
micro-turbine to the grid-side frequency of the utility system. Moreover,
using the switching strategy, the low-order harmonics in the output current
and voltage are not produced, and consequently, the size of the output filter
would be reduced. In fact, the suggested control strategy is load-independent
and has no frequency conversion restrictions. The proposed approach for
voltage and frequency regulation demonstrates exceptional performance and
favorable response across various load alteration scenarios. The suggested
strategy is examined in several scenarios in the MG test systems, and the
simulation results are addressed.
Enhancing battery system identification: nonlinear autoregressive modeling fo...IJECEIAES
Precisely characterizing Li-ion batteries is essential for optimizing their
performance, enhancing safety, and prolonging their lifespan across various
applications, such as electric vehicles and renewable energy systems. This
article introduces an innovative nonlinear methodology for system
identification of a Li-ion battery, employing a nonlinear autoregressive with
exogenous inputs (NARX) model. The proposed approach integrates the
benefits of nonlinear modeling with the adaptability of the NARX structure,
facilitating a more comprehensive representation of the intricate
electrochemical processes within the battery. Experimental data collected
from a Li-ion battery operating under diverse scenarios are employed to
validate the effectiveness of the proposed methodology. The identified
NARX model exhibits superior accuracy in predicting the battery's behavior
compared to traditional linear models. This study underscores the
importance of accounting for nonlinearities in battery modeling, providing
insights into the intricate relationships between state-of-charge, voltage, and
current under dynamic conditions.
Smart grid deployment: from a bibliometric analysis to a surveyIJECEIAES
Smart grids are one of the last decades' innovations in electrical energy.
They bring relevant advantages compared to the traditional grid and
significant interest from the research community. Assessing the field's
evolution is essential to propose guidelines for facing new and future smart
grid challenges. In addition, knowing the main technologies involved in the
deployment of smart grids (SGs) is important to highlight possible
shortcomings that can be mitigated by developing new tools. This paper
contributes to the research trends mentioned above by focusing on two
objectives. First, a bibliometric analysis is presented to give an overview of
the current research level about smart grid deployment. Second, a survey of
the main technological approaches used for smart grid implementation and
their contributions are highlighted. To that effect, we searched the Web of
Science (WoS), and the Scopus databases. We obtained 5,663 documents
from WoS and 7,215 from Scopus on smart grid implementation or
deployment. With the extraction limitation in the Scopus database, 5,872 of
the 7,215 documents were extracted using a multi-step process. These two
datasets have been analyzed using a bibliometric tool called bibliometrix.
The main outputs are presented with some recommendations for future
research.
Use of analytical hierarchy process for selecting and prioritizing islanding ...IJECEIAES
One of the problems that are associated to power systems is islanding
condition, which must be rapidly and properly detected to prevent any
negative consequences on the system's protection, stability, and security.
This paper offers a thorough overview of several islanding detection
strategies, which are divided into two categories: classic approaches,
including local and remote approaches, and modern techniques, including
techniques based on signal processing and computational intelligence.
Additionally, each approach is compared and assessed based on several
factors, including implementation costs, non-detected zones, declining
power quality, and response times using the analytical hierarchy process
(AHP). The multi-criteria decision-making analysis shows that the overall
weight of passive methods (24.7%), active methods (7.8%), hybrid methods
(5.6%), remote methods (14.5%), signal processing-based methods (26.6%),
and computational intelligent-based methods (20.8%) based on the
comparison of all criteria together. Thus, it can be seen from the total weight
that hybrid approaches are the least suitable to be chosen, while signal
processing-based methods are the most appropriate islanding detection
method to be selected and implemented in power system with respect to the
aforementioned factors. Using Expert Choice software, the proposed
hierarchy model is studied and examined.
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...IJECEIAES
The power generated by photovoltaic (PV) systems is influenced by
environmental factors. This variability hampers the control and utilization of
solar cells' peak output. In this study, a single-stage grid-connected PV
system is designed to enhance power quality. Our approach employs fuzzy
logic in the direct power control (DPC) of a three-phase voltage source
inverter (VSI), enabling seamless integration of the PV connected to the
grid. Additionally, a fuzzy logic-based maximum power point tracking
(MPPT) controller is adopted, which outperforms traditional methods like
incremental conductance (INC) in enhancing solar cell efficiency and
minimizing the response time. Moreover, the inverter's real-time active and
reactive power is directly managed to achieve a unity power factor (UPF).
The system's performance is assessed through MATLAB/Simulink
implementation, showing marked improvement over conventional methods,
particularly in steady-state and varying weather conditions. For solar
irradiances of 500 and 1,000 W/m2
, the results show that the proposed
method reduces the total harmonic distortion (THD) of the injected current
to the grid by approximately 46% and 38% compared to conventional
methods, respectively. Furthermore, we compare the simulation results with
IEEE standards to evaluate the system's grid compatibility.
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...IJECEIAES
Photovoltaic systems have emerged as a promising energy resource that
caters to the future needs of society, owing to their renewable, inexhaustible,
and cost-free nature. The power output of these systems relies on solar cell
radiation and temperature. In order to mitigate the dependence on
atmospheric conditions and enhance power tracking, a conventional
approach has been improved by integrating various methods. To optimize
the generation of electricity from solar systems, the maximum power point
tracking (MPPT) technique is employed. To overcome limitations such as
steady-state voltage oscillations and improve transient response, two
traditional MPPT methods, namely fuzzy logic controller (FLC) and perturb
and observe (P&O), have been modified. This research paper aims to
simulate and validate the step size of the proposed modified P&O and FLC
techniques within the MPPT algorithm using MATLAB/Simulink for
efficient power tracking in photovoltaic systems.
Adaptive synchronous sliding control for a robot manipulator based on neural ...IJECEIAES
Robot manipulators have become important equipment in production lines, medical fields, and transportation. Improving the quality of trajectory tracking for
robot hands is always an attractive topic in the research community. This is a
challenging problem because robot manipulators are complex nonlinear systems
and are often subject to fluctuations in loads and external disturbances. This
article proposes an adaptive synchronous sliding control scheme to improve trajectory tracking performance for a robot manipulator. The proposed controller
ensures that the positions of the joints track the desired trajectory, synchronize
the errors, and significantly reduces chattering. First, the synchronous tracking
errors and synchronous sliding surfaces are presented. Second, the synchronous
tracking error dynamics are determined. Third, a robust adaptive control law is
designed,the unknown components of the model are estimated online by the neural network, and the parameters of the switching elements are selected by fuzzy
logic. The built algorithm ensures that the tracking and approximation errors
are ultimately uniformly bounded (UUB). Finally, the effectiveness of the constructed algorithm is demonstrated through simulation and experimental results.
Simulation and experimental results show that the proposed controller is effective with small synchronous tracking errors, and the chattering phenomenon is
significantly reduced.
Remote field-programmable gate array laboratory for signal acquisition and de...IJECEIAES
A remote laboratory utilizing field-programmable gate array (FPGA) technologies enhances students’ learning experience anywhere and anytime in embedded system design. Existing remote laboratories prioritize hardware access and visual feedback for observing board behavior after programming, neglecting comprehensive debugging tools to resolve errors that require internal signal acquisition. This paper proposes a novel remote embeddedsystem design approach targeting FPGA technologies that are fully interactive via a web-based platform. Our solution provides FPGA board access and debugging capabilities beyond the visual feedback provided by existing remote laboratories. We implemented a lab module that allows users to seamlessly incorporate into their FPGA design. The module minimizes hardware resource utilization while enabling the acquisition of a large number of data samples from the signal during the experiments by adaptively compressing the signal prior to data transmission. The results demonstrate an average compression ratio of 2.90 across three benchmark signals, indicating efficient signal acquisition and effective debugging and analysis. This method allows users to acquire more data samples than conventional methods. The proposed lab allows students to remotely test and debug their designs, bridging the gap between theory and practice in embedded system design.
Detecting and resolving feature envy through automated machine learning and m...IJECEIAES
Efficiently identifying and resolving code smells enhances software project quality. This paper presents a novel solution, utilizing automated machine learning (AutoML) techniques, to detect code smells and apply move method refactoring. By evaluating code metrics before and after refactoring, we assessed its impact on coupling, complexity, and cohesion. Key contributions of this research include a unique dataset for code smell classification and the development of models using AutoGluon for optimal performance. Furthermore, the study identifies the top 20 influential features in classifying feature envy, a well-known code smell, stemming from excessive reliance on external classes. We also explored how move method refactoring addresses feature envy, revealing reduced coupling and complexity, and improved cohesion, ultimately enhancing code quality. In summary, this research offers an empirical, data-driven approach, integrating AutoML and move method refactoring to optimize software project quality. Insights gained shed light on the benefits of refactoring on code quality and the significance of specific features in detecting feature envy. Future research can expand to explore additional refactoring techniques and a broader range of code metrics, advancing software engineering practices and standards.
Smart monitoring technique for solar cell systems using internet of things ba...IJECEIAES
Rapidly and remotely monitoring and receiving the solar cell systems status parameters, solar irradiance, temperature, and humidity, are critical issues in enhancement their efficiency. Hence, in the present article an improved smart prototype of internet of things (IoT) technique based on embedded system through NodeMCU ESP8266 (ESP-12E) was carried out experimentally. Three different regions at Egypt; Luxor, Cairo, and El-Beheira cities were chosen to study their solar irradiance profile, temperature, and humidity by the proposed IoT system. The monitoring data of solar irradiance, temperature, and humidity were live visualized directly by Ubidots through hypertext transfer protocol (HTTP) protocol. The measured solar power radiation in Luxor, Cairo, and El-Beheira ranged between 216-1000, 245-958, and 187-692 W/m 2 respectively during the solar day. The accuracy and rapidity of obtaining monitoring results using the proposed IoT system made it a strong candidate for application in monitoring solar cell systems. On the other hand, the obtained solar power radiation results of the three considered regions strongly candidate Luxor and Cairo as suitable places to build up a solar cells system station rather than El-Beheira.
An efficient security framework for intrusion detection and prevention in int...IJECEIAES
Over the past few years, the internet of things (IoT) has advanced to connect billions of smart devices to improve quality of life. However, anomalies or malicious intrusions pose several security loopholes, leading to performance degradation and threat to data security in IoT operations. Thereby, IoT security systems must keep an eye on and restrict unwanted events from occurring in the IoT network. Recently, various technical solutions based on machine learning (ML) models have been derived towards identifying and restricting unwanted events in IoT. However, most ML-based approaches are prone to miss-classification due to inappropriate feature selection. Additionally, most ML approaches applied to intrusion detection and prevention consider supervised learning, which requires a large amount of labeled data to be trained. Consequently, such complex datasets are impossible to source in a large network like IoT. To address this problem, this proposed study introduces an efficient learning mechanism to strengthen the IoT security aspects. The proposed algorithm incorporates supervised and unsupervised approaches to improve the learning models for intrusion detection and mitigation. Compared with the related works, the experimental outcome shows that the model performs well in a benchmark dataset. It accomplishes an improved detection accuracy of approximately 99.21%.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
A review on techniques and modelling methodologies used for checking electrom...nooriasukmaningtyas
The proper function of the integrated circuit (IC) in an inhibiting electromagnetic environment has always been a serious concern throughout the decades of revolution in the world of electronics, from disjunct devices to today’s integrated circuit technology, where billions of transistors are combined on a single chip. The automotive industry and smart vehicles in particular, are confronting design issues such as being prone to electromagnetic interference (EMI). Electronic control devices calculate incorrect outputs because of EMI and sensors give misleading values which can prove fatal in case of automotives. In this paper, the authors have non exhaustively tried to review research work concerned with the investigation of EMI in ICs and prediction of this EMI using various modelling methodologies and measurement setups.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
2. Int J Elec & Comp Eng ISSN: 2088-8708
Towards optimize-ESA for text semantic similarity: A case study of biomedical text (Khaoula Mrhar)
2935
In contrast, to solve these shortcomings, Wikipedia is an outstanding resource for text semantic
similarity problem. It’s a large-scale collaborative open encyclopedia that has evolved into a comprehensive
resource with very good coverage on diverse topics, important entities, events, it widely covers named
entities, domain specific entities, and new entities. The English Wikipedia currently contains over 4 million
articles (including redirection articles). Furthermore, WikiRelate [7] was the first work which compute
the measures of semantic relatedness using Wikipedia, this approach applied the familiar technique used in
semantic relatedness based on wordnet and modified it to be used in Wikipedia, such as path-length
measure [8], but in general the results are similar. However, Gabrilovich and Markovitch (2007) [5] propose
a new approach with Explicit Semantic Analysis (ESA) that achieve highly accurate results, this method has
been extensively studied in many applications [9]. ESA use Wikipedia as a semantic interpreter and builds
a weighted inverted vector that maps each term into a list of Wikipedia articles in which it appears, and
computes the similarity between vectors generated from two terms or texts. It means that the inverted vector
may contain a millions of columns with many 0 value considering the sheer size of Wikipedia articles
(more than 4Mconcepts). Accordingly, interpreting text based on all Wikipedia concepts can be expensive
and computing semantic relatedness after between this huge vectors using Cosine similarity, the efficiency of
ESA will slow down.
Several related paper are interested to this problem. [10] Propose Economy-ESA which is an
economic schema of explicit semantic analysis ESA, by reduce the ESA index matrix dimension using
random selection, k-means and norm-based clustering approaches. The authors in [11] propose a novel
graph-based relatedness assessment method using Wikipedia features to avoid the drawbacks. It propose
Naive-ESA algorithm to return the top 𝑘 most relevant Wikipedia in order to reduce the dimensional space of
Explicit Semantic Analysis (ESA). An efficient and effective algorithm was proposed in [12], it’s represent
the meaning of a text by using the concepts that best match it. This approach first computes the approximate
top-k Wikipedia concepts that are most relevant to the given text and then leverage these concepts for
representing the meaning of the given text. Following the above-mentioned studies, in this paper we present
a new method that optimize ESA approach and resolve some of its limitation and drawbacks. Optimize-ESA
reduce the dimension at the interpretation stage by computing the semantic similarity in a specific domain.
Thus, based on several works [13], using a domain knowledge base is more beneficial and
performant in sematic similarity computation process [14]. This result has pushed many researchers to use
domain knowledge base when the text input domain is already known.The based majority of work in
semantic similarity in a specific domain are in a biomedical domain because of the proliferation of textual
resources and the importance of the terminology. In this context, the state-of-the-art methods for calculating
semantic relatedness in a specific domain can be roughly divided into two main groups. Those that are
concentrated on ontology based methods [15] And distributional methods that use the domain specific
corpus [16]. Many attempts to use Wikipedia to compute semantic similarity in a specific domain. [17]
assesses the suitability of Wikipedia in the biomedical domain as a potential knowledge resource for
semantic relatedness computation by comparing it with other methods (ontology based, distributional
methods). However, Jaiswal [18] propose a method for calculating the semantic relatedness of text related to
diseases, conditions, and wellness issues that uses ESA with MedlinePlus as its knowledge base instead of
Wikipedia.
In this paper, we propose an approach optimize-ESA that perform the ESA approach and provides
significant gains in execution time and space consuming without causing significant reduction in precision.
In our approach we limit the K concept based on the category Wikipedia tree and the domain input.
After that, we leverage these concepts vector to map a text from the keyword-space into the concept-space
optimized. All evaluations are performed on datasets containing pairs of terms from biomedical domain and
a gold standard semantic similarity value for each pair. The results are compared with the results of the ESA
approach and the other state of art semantic similarity approach. The remainder of this paper is organized as
follows. Section 3 present our method optimize-ESA and it architecture, Section 4 details the experiments
that evaluate the effectiveness of our method and reports the analysis of results in the biomedical domain.
Finally, we remark our conclusion and present some perspectives for future research in Section 5.
2. PROPOSED APPROACH: OPTIMIZE-ESA FOR SEMANTIC SIMILARITY MEASURES
2.1. The Wikipedia features
Wikipedia is a large online encyclopedia founded in 2001 and it is a free, editable by users,
web-based, collaborative, multilingual encyclopedia. While it underwent a tremendous growth and currently
comprises more than 2,382,000 articles in about 250 languages. And become one of the most important
information resources in the web.
3. ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 10, No. 3, June 2020 : 2934 - 2943
2936
Wikipedia content is presented on pages:
Articles: Are the normal page in Wikipedia that contain encyclopedic information, Each article describes
a single concept or topic with a concise title that can be used in a ontologies and a brief overview of
the topic. There is only one article for each concept or topic.
Redirects: Redirects is a Wikipedia page which automatically redirects users to another page (connect
articles to articles or section of an article). It is possible to redirect to just a specific section of the target
page.
Disambiguation pages: disambiguation is the process of resolving conflicts when article title is
ambiguous, it contain a list of articles corresponding to different meaning of the same word. For example,
the word "Java" can refer to an island of Indonesia, a programming language, a French band, and many
other things.
Categories: categories are nodes for hierarchical organization of articles, it intend to group pages on
similar subjects, almost all Wikipedia articles are within one or more categories.Wikipedia category is
organized as a network that we present briefly in section 3.3.1.
2.2. ESA Approach
Explicit Semantic Analysis created by Gabrilovich and Markovitch [19]. This approach consist to
represent texts as weighted mixture of a set concepts and using Wikipedia concept which each concept is
a title of Wikipedia page. The main advantage of this approach is the use of a vast amount of highly human
knowledge. The first step of this approach is to construct the semantic interpreter that maps fragments of
natural language text into a weighted sequence of Wikipedia concepts ordered by their relevance to the input.
Given a input text Fragment T compose of I words T={wi}, we first represent it as an interpretation vectors
using TFIDF Schema Vi , where Vi is the weight of the word wi. Then, we use Wikipedia articles as index
documents, each Wikipedia concept is represented as a vector of words that occur in the corresponding
article. Entries of these vectors are assigned weights using TFIDF scheme. Hence, these weights quantify
the strength of association between words and concepts. We build an inverted index which maps each word
into a list of concept in which it appears. Let Kj be an inverted index entry for word Wi , which Kj quantifies
the strength of association of word Wi with Wikipedia concept cj , {cj, c1, . . . , cN} (where N denotes
the total number of Wikipedia concepts). Then, the semantic interpretation vector V for text T is a vector of
length N, in which the weight of each concept Cj is defined as ∑wi€T vi . kj Entries of this vector reflect
the relevance of the corresponding concepts to text T . After That ESA uses Cosine metric to compute
semantic relatedness of a pair of text fragments by comparing their vectors. The Figure 1 below present
the whole ESA process.
Figure 1. Explicit semantic analysis ESA system
4. Int J Elec & Comp Eng ISSN: 2088-8708
Towards optimize-ESA for text semantic similarity: A case study of biomedical text (Khaoula Mrhar)
2937
The ESA approach is simple and efficient, however the process is too expensive for many reasons.
Firstly, the dimension of concept vector for a given word is too large because it length equals to all concepts
in Wikipedia considering the sheer size of Wikipedia article (more than 4 M concept). Secondly, to produce
this concept vector, the overall index matrix must be multiplied by a term vector and give a large index
matrix that requires numerous multiplications. Thirdly, the space vector of a word is a matrix in which most
of the elements are zero because the word will appears just in a few Wikipedia articles. The reinterpretation
of text based on Wikipedia concept can be very expensive and slow, because we lose a lot of time in
unnecessary operations because the zero value in high-dimensional sparse vectors can impact efficiency and
performance of ESA approach. Finally the computations of similarity or relatedness between two vectors
with numerous dimensions are very costly. Thus, because of this problems, we propose in this paper an
approach which optimize the ESA approach and allowed us to not return the vector space for the whole
concepts in Wikipedia but only the top k concepts most relevant. Indeed, given a domain specific, we select
the most relevant Wikipedia articles related to domain Di based on Wikipedia category network.
Furthermore, we create a domain index Ui that save the inverted index of Wikipedia articles of each domain
calculated after a domain Di entered. And for each text T in a specific domain Di, we semantically reinterpret
it based on k concept saved in domain index Uj. We process an update for this domain index according to
Wikipedia update frequency. We present briefly the optimize ESA approach in the section below.
2.3. Optimize-ESA approach
In this paper, we propose an approach to compute a semantic similarity in a specific domain called
the Optimize-ESA approach. This approach resolve some of the shortcomings of ESA approach and optimize
it in term of space consuming and time similarity computation. The architecture of our approach presented in
Figure 2, it consist of two layers: filter k concept for domain Di and build a domain inverted index.
Figure 2. Optimize-ESA architecture
2.3.1. First Layer: filter K concept for domain Di
The relationship between concept or article and category in Wikipedia is expressed by a link called
category link (the English version contain 49.98 million inter links in September 2006 [20]). Indeed,
the Wikipedia category system is socially created and edited and any user can create an article and classify it
into category. This leads to a tremendous growth of articles and categories in Wikipedia (more than 500000
categories in English Wikipedia article [20] ). Consequently, Wikipedia editors try to better organize
Wikipedia category structure by purifying certain concepts and split category into multiple fine-grained
categories (the number of categories in wiki-14 was increased 25% than wiki-12). Furthermore, the category
system in Wikipedia is represented as a directed graph where nodes represent pages or categories, and edges
represent the oriented relationship “is assigned to”. Every category has a multiple parents and children
categories. And each category is connected to a number of articles (coverage all Wikipedia articles by
a category). Besides, the category system in Wikipedia has a taxonomy structure which is a hierarchy of
5. ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 10, No. 3, June 2020 : 2934 - 2943
2938
topics and subtopics as shown in Figure 3. It enables us to search articles by narrowing from broader
categories to the down categories. Indeed, Wikipedia offer a category tree system [21] which enable users to
browse categories but not all concepts belonging to a specific category because it’s not a tree structure.
Nevertheless, starting by a category we can traverse the descendant categories and detect all articles
connected.
Figure 3. Category tree wikipedia
In this part, we use the Wikipedia category system to extract the articles or concept related to an
input domain D. using this category system, we can consider our input domain as a category in Wikipedia
and try to search all category belonging, as well as by traversing the descendant categories extract all articles
connected. However, as the level increases, we can note that the articles covered are augmented more and
more almost all the articles in the Wikipedia are covered. That means, all the articles belong to all the broad
categories, which is incorrect. So our issue is how to define which level of the breadth first traversal we need
to stop, in other words, in which level in Wikipedia tree structure the categories are effectively related to
the category input. Therefore, we propose to compute the semantic similarity between category input and all
categories in each level, and deciding after experimentation in which level we need to stop. The Table 1
below present the result of our experimentation.
Based on several experimentation and observation, we find that the categories level that are
effectively related to the domain input changes from one domain to another and is not always correct to stop
in a specific category level (computer science at 8 level and bioinformatics at 7 level). because it is according
to the number of down categories of this domain existing in Wikipedia category system. Therefore,
the categories extracted must be based on a semantic similarity measure between domain input and
the categories in each level. Consequently, after experimentation, we decided to stop the extraction of sub
categories related to domain input after a similarity value of 0.4. The Figure 4 presents the whole process of
detecting the Wikipedia articles related to a specific domain input.
6. Int J Elec & Comp Eng ISSN: 2088-8708
Towards optimize-ESA for text semantic similarity: A case study of biomedical text (Khaoula Mrhar)
2939
Table 1. Correlation semantic similarity between wikipedia category tree levels
Category input Correlation Semantic Similarity
Category Tree Levels
1 2 3 4 5 6 7 8 9 10 11 12
Computer science 0.8389 0.6858 0.6127 0,587 0,557 0,517 0,489 0,435 0,387 0,345 0,287 0,198
Bioinformatics 0.629 0.5058 0,502 0,497 0,476 0,456 0,436 0,427 0,357 0,245 0,227 0,175
Biology 0.7205 0.6063 0,598 0,576 0,554 0,518 0,486 0,423 0,397 0,297 0,267 0,109
Medicine 0.7379 0.64498 0.5870 0.5630 0.5563 0.536 0,501 0,456 0,345 0,329 0,234 0,206
Figure 4. The process of detecting wikipedia articles related to domain input
2.3.2. Second layer: Build domain index Ui
After the filtering of the Wikipedia articles related to a specific domain Di, we build an inverted
index domain Di which maps each word into a list of concept in which it appears as presented in section
3.2.1. Let kj be an inverted index entry for word wi, where kj quantifies the strength of association of word
Wi with Wikipedia concept cj , {cj ∍ C1,…..,Cn}, where n denotes the number of Wikipedia concept filtered
for domain Di as appear in table 2.
Table 2. Wikipedia articles filtered for domain Di
WA1 ….. ….. WAj
Term 1 T[0,0]
…..
…..
Term k ….. …. …. T[i,j]
Terms in wikipedia articles filtered for domain Di
After building the weighted inverted index for domain Di, We store it in a database as Ui to use it
for any future interpretation to optimize the computation of semantic similarity. Our database must be
updated for selecting new articles added to Wikipedia, the algorithm of our method is presented below:
//the algorithm create the inverted index wikipedia for a specific domain that can be used in the similarity
semantic measures between text based on ESA method
// Input : domain Di
// output : domain index Ui
step 1
//extract k concept related to domain input Di
for domain Di
if Di exist in U
return U[Di]
Else
7. ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 10, No. 3, June 2020 : 2934 - 2943
2940
//Search Di in node category tree wikipedia
Ck =search [Di,CT]
//extract K concept Ck [C1….Ck] belongs to Di
Return Ck
Step 2
// build inverted index for domain Di, WDi
For C1 to Ck
WDi [C1…..Ck]
store WDi in Ui
return Ui
stop
Furthermore, To compute the semantic similarity between two text T1 and T2 , we consider it as
a bag of words T1= {t1,t2,…tn} with n words. And we semantically reinterpret it based on k concept saved
in domain index Ui. And finally we compute the sematic similarity between the two text vectors based on
a cosines similarity metric.
3. RESULTS AND DISCUSSION
3.1. Case study: biomedical domain
In the last years, the amount of information available in textual format is rapidly increasing in
the biomedical domain such as patient health records and medical documents. Therefore, Measures of
semantic relatedness between concepts and texts is widely used in this domain, discovering similar
diseases [22], and redundancy detection in clinical records [23], comparing gene products [24], identifying
direct and indirect protein interactions within human regulatory pathways using gene ontology [25], coding
medical diagnoses and adverse drug reactions using semantic distance [26]. Furthermore, the classical
semantic similarity computation measures have been adapted to be used in several domain. However, these
measures are less efficient due to the limited coverage of specialized domains. That is why, the need to use
a specialized knowledge base such as in the biomedical domain, by exploiting the medical ontologies,
knowledge repositories and biomedical structured vocabularies. For this reason, we propose in this paper
a domain specialized method that optimize ESA semantic similarity approach. We choose to test
the performance of our method on three biomedical dataset because of the availability and proliferation of
the resources. We present in the section below the dataset used in our experimentation and the interpretation
of our result.
3.2. Experimentation
Humans have an innate ability to judge semantic relatedness of texts. Accordingly, to evaluate
the performance of machine measurement of semantic similarity between texts, we compare them with
human rating on the same setting by compare the correlation between human judgement and machine
calculations. In this work, because of the no suitability of dataset of biomedical pairs sentences as appear in
Table 3. We use BIOSSES Dataset [27], which is a benchmark dataset for biomedical sentences similarity
estimation. It contain 100 sentences pair selected from the TAC (Text Analysis Conference) biomedical
summarization track training containing articles from the biomedical domain. The sentences pairs were
evaluated by five different human expert that give a scores ranging from 0 (no relation) to 4 (equivalent).
Which averaged for each pair to produce a single relatedness score. We test our method also on two French
Web corpora [28]. The first corpus is about “epidemics” and the second one is about “space conquest.” Each
corpus contains reference sentences and each of them was associated with six sentences chosen with
similarities score ranging from 0 (the sentences are unrelated) and 4 (the sentences are completely
equivalent).
Table 3. The datasets used in semantic similarity task
Dataset Pairs Scale References
BIOSSES 100 1-5 [27]
Epidemics 60 0-4 [28]
Space conquest 60 0-4 [28]
Following the literature on semantic relatedness, we evaluate the performance by measuring a pair
correlation scores between the score assigned by the proposed method and human judgement score for each
dataset we report the correlation computed on all pairs with the metric Pearson’s correlation coefficients.
8. Int J Elec & Comp Eng ISSN: 2088-8708
Towards optimize-ESA for text semantic similarity: A case study of biomedical text (Khaoula Mrhar)
2941
The Pearson’s correlation metric denoted as P reflects the linear correlation between measuring result with
human judgments, where 0 means uncorrelated and 1 means perfect correlated. The corresponding formula is
defined as:
where 𝑥𝑖 refers to the value of the ith in the dataset given by human judgments, 𝑦𝑖 to the corresponding value
returned by an Optimize-ESA method, and n to the length of the target dataset.
Table 4 show the correlation coefficient Pearson by the ESA algorithm and our methods Optimize-
ESA for the three datasets BIOSSES, Epidemics and Space Conquest. Our method optimize ESA gets
a correlation of 0.612 compared to 0.595 for ESA method for sentences dataset BIOSSES. On the Epidemics
dataset, our method gets a correlation of 0.544 compared to 0.525 for the full version ESA. And ESA
approach with Wikipedia knowledge base get a correlation of 0.558 for Space conquest dataset compared to
0.571 for our method. This clearly show that our method correlates much better with human judgement than
the full version ESA approach. A comparison of our method Optimize-ESA and some state-of-art for
computing semantic relatedness in the biomedical domain is shown in Table 5. We compare it with Resink
and Lin which is the most popular information content measures in knowledge based methods. In addition,
Levenshtein which is a string based measure. Besides comparing our optimize ESA with the traditional ESA
approach with wikipedia as a knowledge graph.
Table 4. The comparison of Pearson’s correlation coefficient on BIOSSES, Epidemics,
Space conquest Datasets
Dataset ESA Algorithm (Gabrilovich & Markovitch, 2007) Optimize-ESA
Pearson’s (P) Pearson’s (P)
BIOSSES 0.595 0.612
Epidemics 0.525 0.544
Space conquest 0.558 0.571
Table 5. Correlation coefficients pearson (P) between related studies
Related studies Dataset References
BIOSSES Epidemics Space Conquest
IC-based measures
Resink 0.473 0.396 0.412 P.Resnik [29]
Lin 0.645 0.591 0.611 D.Lin [30]
String similarity measures
Levenshtein 0.592 0.601 0.591 Finkelstein et al.,
[31]
ESA similarity measures
ESA-wiki 0.595 0.525 0.558 Gabrilovich and
Markovitch [19]
Optimize-ESA 0.612 0.544 0.571
As the above results in Table 5 indicate that the optimize-ESA can obtain competitive results for
Pearson correlation especially for the small dataset. In contrast, in the big size dataset, the use of the full
version ESA including all concepts in Wikipedia or optimize-ESA in a domain specific is more performant
compared to string similarity measure and IC based measures. Furthermore, we noticed that our method
optimize-ESA is faster than ESA with full Wikipedia after an experimentation presented in Figure 5.
We measured the cosines similarity processing cost of six pairs from each test collection and we compute
the running time comparison between ESA and Optimize-ESA.
9. ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 10, No. 3, June 2020 : 2934 - 2943
2942
Figure 5: ESA & Optimize-ESA Running Time
4. CONCLUSION AND FUTURE WORK
The study of semantic similarity between words has long been an integral part of information
retrieval and natural language processing. Based on the theoretical principles and the way in which
ontologies are investigated to compute similarity, different kinds of methods can be identified according to
type, size and domain of dataset. Among these methods, we can cite the Explicit Semantic Analysis ESA
approach with Wikipedia knowledge base which perform very well the task of computing the sematic
relatedness of word and text fragment. However, The ESA process is too expensive due to the large length
dimension of concept vector for a given word which equals all Wikipedia concept (4 M). And the efficiency
of ESA will slow down because we lose a lot of time in unnecessary operations.
We propose in this paper a new method called optimize-ESA which reduce the dimension at
the interpretation stage by computing the semantic similarity in a specific domain. To evaluate
the performance of our method, we give a comparison between different algorithms for Semantic Relatedness
in the biomedical domain. We choose the biomedical domain because of the availability of different
ontologies and methods, which is significantly higher than any other domain. We conclude that our method
outperforms the current state-of-the-art methods for calculating the semantic relatedness of biomedical texts
as it correlates much better with human judgements. There are two other interesting lines of future research
related to the method presented in this work. Firstly, we plan to more optimize our method by filtering
the Wikipedia concept using the domain specific knowledge based leveraged with Wikipedia category tree.
Secondly, we plan to more perform the result of ESA by adding to the weighted inverted index a category
index. Finally, a wider evaluation will be desirable, considering larger sets of text pairs as benchmark data in
other domain.
REFERENCES
[1] S. Tongphu, "Toward Semantic Similarity Measure Between Concepts in An Ontology," Indonesian Journal of
Electrical Engineering and Computer Science, vol. 14, no. 3, pp. 1356-1372, 2019.
[2] Bazi and N. Laachfoubi, "Arabic Named Entity Recognition using Deep Learning Approach," International
Journal of Electrical and Computer Engineering (IJECE), vol. 9, no. 3, pp. 2025-2032, 2019.
[3] Y. Zhang, R. Jin and Z. Zhou, "Understanding Bag-of-Words Model: A Statistical Framework", International
Journal of Machine Learning and Cybernetics, vol. 1, no. 1-4, pp. 43-52, 2010.
[4] T. Landauer, P. Foltz and D. Laham, "An Introduction to Latent Semantic Analysis," Discourse Processes, vol. 25,
no. 2-3, pp. 259-284, 1998.
[5] Evgeniy Gabrilovich and Shaul Markovitch. “Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing
Text Categorization with Encyclopedic Knowledge,” In AAAI’ 06, pp. 1301–1306, 2006.
[6] A. Budanitsky and G. Hirst, "Evaluating WordNet-based Measures of Lexical Semantic Relatedness,"
Computational Linguistics, vol. 32, no. 1, pp. 13-47, 2006.
[7] M. Strube, and S. P. Ponzetto. WikiRelate! Computing Semantic Relatedness using Wikipedia,” Proceedings of
the 21st National Conference on Artificial intelligence, Boston, Massachusetts, pp.1419-1424, 2006.
[8] Nurifan, R. Sarno and C. Wahyuni, "Developing Corpora using Wikipedia and Word2vec for Word Sense
Disambiguation," Indonesian Journal of Electrical Engineering and Computer Science, vol. 12, no. 3,
pp. 1239-1246, 2018.
[9] K. Dramé, F. Mougin and G. Diallo, "Large Scale Biomedical Texts Classification: A kNN and An ESA-Based
Approaches," Journal of Biomedical Semantics, vol. 7, no. 1, 2016.
10. Int J Elec & Comp Eng ISSN: 2088-8708
Towards optimize-ESA for text semantic similarity: A case study of biomedical text (Khaoula Mrhar)
2943
[10] F. Rahutomo, Y. Manabe, T. Kitasuka and M. Aritsugi, "Econo-ESA Reduction Scheme and the Impact of its Index
Matrix Density," Procedia Computer Science, vol. 35, pp. 474-483, 2014.
[11] P. Li, B. Xiao, W. Ma, Y. Jiang and Z. Zhang, "A Graph-Based Semantic Relatedness Assessment Method
Combining Wikipedia Features," Engineering Applications of Artificial Intelligence, vol. 65, pp. 268-281, 2017.
[12] Kim, J. W. Kashyap, A and Bhamidipati, S. Wikipedia-Based Semantic Interpreter using approximate Top-K
Processing and Its Application, vol. 18, pp. 650-675, 2012.
[13] X. Song, "Ontology-based Domain-specific Semantic Similarity Analysis and Applications," Computer Science,
All Dissertations, 2018.
[14] T. Gottron, M. Anderka and B. Stein, “Insights into Explicit Semantic Analysis,” In Proceedings of the 20th ACM
International Conference on Information and Knowledge Management (CIKM), pp. 1961-1964, 2011.
[15] V. Garla and C. Brandt, "Semantic Similarity in the Biomedical Domain: An Evaluation Across Knowledge
Sources," BMC Bioinformatics, vol. 13, no. 1, 2012.
[16] D. Sánchez, M. Batet, and A. Valls, “Web-Based Semantic Similarity: an evaluation in the Biomedical Domain,”
Int J Softw Inform, vol. 4, pp. 39-52, 2010.
[17] E. Costa, H. Tjandrasa and S. Djanali, "Text Mining for Pest and Disease Identification on Rice Farming with
Interactive Text Messaging," International Journal of Electrical and Computer Engineering (IJECE), vol. 8, no. 3,
pp. 1671, 2018.
[18] A. Jaiswal, and A. Bhargava, “Explicit Semantic Analysis for Computing Semantic Relatedness of Biomedical
Text,” In Proceedings of Confluence The Next Generation Information Technology Summit (IEEE), 2014
[19] E. Gabrilovich and S. Markovitch. “Computing Semantic Relatedness using Wikipedia-based Explicit Se mantic
Analysis,” In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI),
pp. 1606-1611, 2007.
[20] R. B. Bairi, M. Carman, and G. Ramakrishnan, "On the Evolution of Wikipedia: Dynamics of Categories and
Articles," Ninth International AAAI Conference on Web and Social Media, pp. 6-10, 2015.
[21] "PetScan", Petscan.wmflabs.org, 2019. [Online]. Available: https://petscan.wmflabs.org/.
[Accessed: 31- May- 2019].
[22] S. Mathur and D. Dinakarpandian, "Finding Disease Similarity Based on Implicit Semantic Similarity,"
Journal of Biomedical Informatics, vol. 45, no. 2, pp. 363-371, 2012.
[23] Zhang R, Pakhomov S, McInnes B. T, and Melton G. B, “Evaluating Measures of redundancy in clinical texts,”
AMIA Annu Symp Proc. pp. 1612-1620, 2011.
[24] C. Pesquita, D. Faria, A. Falcão, P. Lord and F. Couto, "Semantic Similarity in Biomedical Ontologies,"
PLoS Computational Biology, vol. 5, no. 7, 2009.
[25] X. Guo, R. Liu, C. Shriver, H. Hu and M. Liebman, "Assessing Semantic Similarity Measures for
the Characterization of Human Regulatory Pathways," Bioinformatics, vol. 22, no. 8, pp. 967-973, 2006.
[26] C. Bousquet, et al., "Appraisal of the MedDRA Conceptual Structure for Describing and Grouping Adverse Drug
Reactions," Drug Safety, vol. 28, no. 1, pp. 19-34, 2005.
[27] G. Soğancıoğlu, H. Öztürk and A. Özgür, "BIOSSES: A Semantic Sentence Similarity Estimation System for
the Biomedical Domain," Bioinformatics, vol. 33, no. 14, pp. i49-i58, 2017.
[28] Vu, H. H., Villaneau, J., Saïd, F., and Marteau, P. F. “Sentence Similarity by combining Explicit Semantic Analysis
and Overlapping N-Grams,” In Proceedings of the 17th International Conference on Text, Speech and Dialogue
(TSD 2014), Brno, Czech Republic, Springer International Publishing, pp. 201–208, 2014.
[29] Resnik, P., “Using Information Content to Evaluate Semantic Similarity in A Taxonomy. In Proceedings of IJCAI,
pp. 448–453. 1995.
[30] Dekang Lin, “An Information-Theoretic Definition of Word Similarity,” In ICML’98, 1998
[31] L. Finkelstein, et al., “Placing Search in Context: The Concept Revisited,” ACM Transactions on Information
Systems, vol. 20, pp. 116–131, 2002.
BIOGRAPHIES OF AUTHORS
Khaoula Mrhar is currently Ph.D student at IPSS research team in Mohammed V university
Rabat, Morocco. Her background includes a degree in mathematics and computer science.
Her research interest contains Formal and Non Formal learning, artificial intelligence, data
integration, text mining and recommender system.
Mounia Abik I received a PhD from the National High School for Computer Science and Systems
Analysis (ENSIAS) in 2009 and an Habilitation to Drive Research (HDR) from Mohammed V
University of Rabat in 2014. My main research interests focus on e-Learning, Knowledge
Extraction from Social Networks, Semantic Web and Cyber-violence.