In this study, a multi-verse optimizer (MVO) is utilised for the text document clus- tering (TDC) problem. TDC is treated as a discrete optimization problem, and an objective function based on the Euclidean distance is applied as similarity measure. TDC is tackled by the division of the documents into clusters; documents belonging to the same cluster are similar, whereas those belonging to different clusters are dissimilar. MVO, which is a recent metaheuristic optimization algorithm established for continuous optimization problems, can intelligently navigate different areas in the search space and search deeply in each area using a particular learning mechanism. The proposed algorithm is called MVOTDC, and it adopts the convergence behaviour of MVO operators to deal with discrete, rather than continuous, optimization problems. For evaluating MVOTDC, a comprehensive comparative study is conducted on six text document datasets with various numbers of documents and clusters. The quality of the final results is assessed using precision, recall, F-measure, entropy accuracy, and purity measures. Experimental results reveal that the proposed method performs competitively in comparison with state-of-the-art algorithms. Statistical analysis is also conducted and shows that MVOTDC can produce significant results in comparison with three well-established methods.
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...IJDKP
Many applications of automatic document classification require learning accurately with little training
data. The semi-supervised classification technique uses labeled and unlabeled data for training. This
technique has shown to be effective in some cases; however, the use of unlabeled data is not always
beneficial.
On the other hand, the emergence of web technologies has originated the collaborative development of
ontologies. In this paper, we propose the use of ontologies in order to improve the accuracy and efficiency
of the semi-supervised document classification.
We used support vector machines, which is one of the most effective algorithms that have been studied for
text. Our algorithm enhances the performance of transductive support vector machines through the use of
ontologies. We report experimental results applying our algorithm to three different datasets. Our
experiments show an increment of accuracy of 4% on average and up to 20%, in comparison with the
traditional semi-supervised model.
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
Data partitioning methods are used to partition the data values with similarity. Similarity
measures are used to estimate transaction relationships. Hierarchical clustering model produces tree
structured results. Partitioned clustering produces results in grid format. Text documents are
unstructured data values with high dimensional attributes. Document clustering group ups unlabeled text
documents into meaningful clusters. Traditional clustering methods require cluster count (K) for the
document grouping process. Clustering accuracy degrades drastically with reference to the unsuitable
cluster count.
Textual data elements are divided into two types’ discriminative words and nondiscriminative
words. Only discriminative words are useful for grouping documents. The involvement of
nondiscriminative words confuses the clustering process and leads to poor clustering solution in return.
A variation inference algorithm is used to infer the document collection structure and partition of
document words at the same time. Dirichlet Process Mixture (DPM) model is used to partition
documents. DPM clustering model uses both the data likelihood and the clustering property of the
Dirichlet Process (DP). Dirichlet Process Mixture Model for Feature Partition (DPMFP) is used to
discover the latent cluster structure based on the DPM model. DPMFP clustering is performed without
requiring the number of clusters as input.
Document labels are used to estimate the discriminative word identification process. Concept
relationships are analyzed with Ontology support. Semantic weight model is used for the document
similarity analysis. The system improves the scalability with the support of labels and concept relations
for dimensionality reduction process.
This document compares hierarchical and non-hierarchical clustering algorithms. It summarizes four clustering algorithms: K-Means, K-Medoids, Farthest First Clustering (hierarchical algorithms), and DBSCAN (non-hierarchical algorithm). It describes the methodology of each algorithm and provides pseudocode. It also describes the datasets used to evaluate the performance of the algorithms and the evaluation metrics. The goal is to compare the performance of the clustering methods on different datasets.
Improved wolf algorithm on document images detection using optimum mean techn...journalBEEI
Detection text from handwriting in historical documents provides high-level features for the challenging problem of handwriting recognition. Such handwriting often contains noise, faint or incomplete strokes, strokes with gaps, and competing lines when embedded in a table or form, making it unsuitable for local line following algorithms or associated binarization schemes. In this paper, a proposed method based on the optimum threshold value and namely as the Optimum Mean method was presented. Besides, Wolf method unsuccessful in order to detect the thin text in the non-uniform input image. However, the proposed method was suggested to overcome the Wolf method problem by suggesting a maximum threshold value using optimum mean. Based on the calculation, the proposed method obtained a higher F-measure (74.53), PSNR (14.77) and lowest NRM (0.11) compared to the Wolf method. In conclusion, the proposed method successful and effective to solve the wolf problem by producing a high-quality output image.
Data reduction techniques for high dimensional biological dataeSAT Journals
Abstract
High dimensional biological datasets in recent years has been growing rapidly. Extracting the knowledge and analyzing highdimensional
biological data is one the key challenges in which variety and veracity are the two distinct characteristics. The
question that arises now is, how to perform dimensionality reduction for this heterogeneous data and how to develop a high
performance platform to efficiently analyze high dimensional biological data and how to find the useful things from this data. To
deeply discuss this issue, this paper begins with a brief introduction to data analytics available for biological data, followed by
the discussions of big data analytics and then a survey on various data reduction methods for biological data. We propose a dense
clustering algorithm for standard high dimensional biological data.
Keywords: Big Data Analytics, Dimensionality Reduction
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERijnlc
This document presents an adaptive log file parser that uses semantics and hidden Markov models. It first clusters log file lines based on semantics to limit unstructured text. It then builds a hidden Markov model to represent parsing patterns, with log entries as states and extracted values as emissions. When applied to a new system, it adapts the model's transition and emission probabilities to fit the new data. The approach achieves over 99.99% accuracy when trained on one system and applied to another with slightly different log patterns.
This document summarizes a research paper that introduces a novel multi-viewpoint similarity measure for clustering text documents. The paper begins with background on commonly used similarity measures like Euclidean distance and cosine similarity. It then presents the novel multi-viewpoint measure, which considers multiple viewpoints (objects not assumed to be in the same cluster) rather than a single viewpoint. The paper proposes two new clustering criterion functions based on this measure and compares them to other algorithms on benchmark datasets. The goal is to develop a similarity measure and clustering methods that provide high-quality, consistent performance like k-means but can better handle sparse, high-dimensional text data.
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...IJDKP
Many applications of automatic document classification require learning accurately with little training
data. The semi-supervised classification technique uses labeled and unlabeled data for training. This
technique has shown to be effective in some cases; however, the use of unlabeled data is not always
beneficial.
On the other hand, the emergence of web technologies has originated the collaborative development of
ontologies. In this paper, we propose the use of ontologies in order to improve the accuracy and efficiency
of the semi-supervised document classification.
We used support vector machines, which is one of the most effective algorithms that have been studied for
text. Our algorithm enhances the performance of transductive support vector machines through the use of
ontologies. We report experimental results applying our algorithm to three different datasets. Our
experiments show an increment of accuracy of 4% on average and up to 20%, in comparison with the
traditional semi-supervised model.
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
Data partitioning methods are used to partition the data values with similarity. Similarity
measures are used to estimate transaction relationships. Hierarchical clustering model produces tree
structured results. Partitioned clustering produces results in grid format. Text documents are
unstructured data values with high dimensional attributes. Document clustering group ups unlabeled text
documents into meaningful clusters. Traditional clustering methods require cluster count (K) for the
document grouping process. Clustering accuracy degrades drastically with reference to the unsuitable
cluster count.
Textual data elements are divided into two types’ discriminative words and nondiscriminative
words. Only discriminative words are useful for grouping documents. The involvement of
nondiscriminative words confuses the clustering process and leads to poor clustering solution in return.
A variation inference algorithm is used to infer the document collection structure and partition of
document words at the same time. Dirichlet Process Mixture (DPM) model is used to partition
documents. DPM clustering model uses both the data likelihood and the clustering property of the
Dirichlet Process (DP). Dirichlet Process Mixture Model for Feature Partition (DPMFP) is used to
discover the latent cluster structure based on the DPM model. DPMFP clustering is performed without
requiring the number of clusters as input.
Document labels are used to estimate the discriminative word identification process. Concept
relationships are analyzed with Ontology support. Semantic weight model is used for the document
similarity analysis. The system improves the scalability with the support of labels and concept relations
for dimensionality reduction process.
This document compares hierarchical and non-hierarchical clustering algorithms. It summarizes four clustering algorithms: K-Means, K-Medoids, Farthest First Clustering (hierarchical algorithms), and DBSCAN (non-hierarchical algorithm). It describes the methodology of each algorithm and provides pseudocode. It also describes the datasets used to evaluate the performance of the algorithms and the evaluation metrics. The goal is to compare the performance of the clustering methods on different datasets.
Improved wolf algorithm on document images detection using optimum mean techn...journalBEEI
Detection text from handwriting in historical documents provides high-level features for the challenging problem of handwriting recognition. Such handwriting often contains noise, faint or incomplete strokes, strokes with gaps, and competing lines when embedded in a table or form, making it unsuitable for local line following algorithms or associated binarization schemes. In this paper, a proposed method based on the optimum threshold value and namely as the Optimum Mean method was presented. Besides, Wolf method unsuccessful in order to detect the thin text in the non-uniform input image. However, the proposed method was suggested to overcome the Wolf method problem by suggesting a maximum threshold value using optimum mean. Based on the calculation, the proposed method obtained a higher F-measure (74.53), PSNR (14.77) and lowest NRM (0.11) compared to the Wolf method. In conclusion, the proposed method successful and effective to solve the wolf problem by producing a high-quality output image.
Data reduction techniques for high dimensional biological dataeSAT Journals
Abstract
High dimensional biological datasets in recent years has been growing rapidly. Extracting the knowledge and analyzing highdimensional
biological data is one the key challenges in which variety and veracity are the two distinct characteristics. The
question that arises now is, how to perform dimensionality reduction for this heterogeneous data and how to develop a high
performance platform to efficiently analyze high dimensional biological data and how to find the useful things from this data. To
deeply discuss this issue, this paper begins with a brief introduction to data analytics available for biological data, followed by
the discussions of big data analytics and then a survey on various data reduction methods for biological data. We propose a dense
clustering algorithm for standard high dimensional biological data.
Keywords: Big Data Analytics, Dimensionality Reduction
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERijnlc
This document presents an adaptive log file parser that uses semantics and hidden Markov models. It first clusters log file lines based on semantics to limit unstructured text. It then builds a hidden Markov model to represent parsing patterns, with log entries as states and extracted values as emissions. When applied to a new system, it adapts the model's transition and emission probabilities to fit the new data. The approach achieves over 99.99% accuracy when trained on one system and applied to another with slightly different log patterns.
This document summarizes a research paper that introduces a novel multi-viewpoint similarity measure for clustering text documents. The paper begins with background on commonly used similarity measures like Euclidean distance and cosine similarity. It then presents the novel multi-viewpoint measure, which considers multiple viewpoints (objects not assumed to be in the same cluster) rather than a single viewpoint. The paper proposes two new clustering criterion functions based on this measure and compares them to other algorithms on benchmark datasets. The goal is to develop a similarity measure and clustering methods that provide high-quality, consistent performance like k-means but can better handle sparse, high-dimensional text data.
This document compares the k-means and grid density clustering algorithms. K-means partitions data into k clusters based on minimizing distances between points and cluster centroids. It works well with numerical data but can be affected by outliers. Grid density determines dense grids based on neighbor densities and can handle different shaped and multi-density clusters without knowing the number of clusters beforehand. It has advantages over k-means in that it can handle categorical data, noise and arbitrary shaped clusters.
This document summarizes an article that proposes using genetic algorithms to improve the effectiveness of a text to matrix generator. It begins with an overview of information retrieval and discusses the vector space model and genetic algorithms. It then proposes a genetic approach to optimize the objective function of a text to matrix generator to increase the average number of terms. The goal is to retrieve more relevant documents by obtaining the best combination of terms from document collections using genetic algorithms. Experimental results are presented to validate that the genetic approach improves the performance of the text to matrix generator.
1) The document discusses different clustering algorithms for text summarization including hierarchical clustering, query-based summarization, graph theoretic clustering, fuzzy c-means clustering, and DBSCAN clustering.
2) These algorithms are evaluated based on performance parameters like precision, recall, time complexity, space complexity, and summary quality.
3) The algorithm found to perform best based on these evaluations will be suggested as the better algorithm for query-dependent text document summarization.
This document discusses GCUBE indexing, which is a method for indexing and aggregating spatial/continuous values in a data warehouse. The key challenges addressed are defining and aggregating spatial/continuous values, and efficiently representing, indexing, updating and querying data that includes both categorical and continuous dimensions. The proposed GCUBE approach maps multi-dimensional data to a linear ordering using the Hilbert curve, and then constructs an index structure on the ordered data to enable efficient query processing. Empirical results show the GCUBE indexing offers significant performance advantages over alternative approaches.
This document summarizes a research paper that proposes a new density-based clustering technique called Triangle-Density Based Clustering Technique (TDCT) to efficiently cluster large spatial datasets. TDCT uses a polygon approach where the number of data points inside each triangle of a polygon is calculated to determine triangle densities. Triangle densities are used to identify clusters based on a density confidence threshold. The technique aims to identify clusters of arbitrary shapes and densities while minimizing computational costs. Experimental results demonstrate the technique's superiority in terms of cluster quality and complexity compared to other density-based clustering algorithms.
Subgraph relative frequency approach for extracting interesting substructurIAEME Publication
The document discusses a Subgraph Relative Frequency (SRF) approach for extracting interesting substructures from molecular data. SRF screens each frequent subgraph to determine if it is interesting based on its relative frequency. This is more efficient than the MISMOC approach, which requires calculating absolute frequencies. SRF was tested on a small molecular data set and found to perform satisfactorily and efficiently for classifying unknown molecules based on their subgraphs. The performance of SRF was comparable to MISMOC but with less computational complexity by using relative rather than absolute frequencies.
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITYIJDKP
This document summarizes an approach to improve source code retrieval using structural information from source code. A lexical parser is developed to extract control statements and method identifiers from Java programs. A similarity measure is proposed that calculates the ratio of fully matching statements to partially matching statements in a sequence. Experiments show the retrieval model using this measure improves retrieval performance over other models by up to 90.9% relative to the number of retrieved methods.
A Soft Set-based Co-occurrence for Clustering Web User TransactionsTELKOMNIKA JOURNAL
This document proposes a soft set-based approach for clustering web user transactions to achieve lower computational complexity and higher clustering purity compared to previous rough set approaches. Unlike rough set approaches that use similarity, the proposed approach uses a co-occurrence approach based on soft set theory. The soft set representation of web user transactions allows modeling as a binary-valued information system. The approach is evaluated in comparison to two previous rough set-based approaches, demonstrating better performance with over 100% lower computational complexity and higher cluster purity.
Multi Label Spatial Semi Supervised Classification using Spatial Associative ...cscpconf
Multi-label spatial classification based on association rules with multi objective genetic
algorithms (MOGA) enriched by semi supervised learning is proposed in this paper. It is to deal
with multiple class labels problem. In this paper we adapt problem transformation for the multi
label classification. We use hybrid evolutionary algorithm for the optimization in the generation
of spatial association rules, which addresses single label. MOGA is used to combine the single
labels into multi labels with the conflicting objectives predictive accuracy and
comprehensibility. Semi supervised learning is done through the process of rule cover
clustering. Finally associative classifier is built with a sorting mechanism. The algorithm is
simulated and the results are compared with MOGA based associative classifier, which out
performs the existing
A h k clustering algorithm for high dimensional data using ensemble learningijitcs
The document summarizes a proposed clustering algorithm for high dimensional data that combines hierarchical (H-K) clustering, subspace clustering, and ensemble clustering. It begins with background on challenges of clustering high dimensional data and related work applying dimension reduction, subspace clustering, ensemble clustering, and H-K clustering individually. The proposed model first applies subspace clustering to identify clusters within subsets of features. It then performs H-K clustering on each subspace cluster. Finally, it applies ensemble clustering techniques to integrate the results into a single clustering. The goal is to leverage each technique's strengths to improve clustering performance for high dimensional data compared to using a single approach.
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MININGIJDKP
This document discusses a hybrid data mining approach called combined mining that can generate informative patterns from complex data sources. It proposes applying three techniques: 1) Using the Lossy-counting algorithm on individual data sources to obtain frequent itemsets, 2) Generating incremental pair and cluster patterns using a multi-feature approach, 3) Combining FP-growth and Bayesian Belief Network using a multi-method approach to generate classifiers. The approach is tested on two datasets to obtain more useful knowledge and the results are compared.
DEVELOPING A NOVEL MULTIDIMENSIONAL MULTIGRANULARITY DATA MINING APPROACH FOR...cscpconf
Data Mining is one of the most significant tools for discovering association patterns that are useful for many knowledge domains. Yet, there are some drawbacks in existing mining techniques. Three main weaknesses of current data-mining techniques are: 1) re-scanning of the entire database must be done whenever new attributes are added. 2) An association rule may be true on a certain granularity but fail on a smaller ones and vise verse. 3) Current methods can only be used to find either frequent rules or infrequent rules, but not both at the same time. This research proposes a novel data schema and an algorithm that solves the above weaknesses while improving on the efficiency and effectiveness of data mining strategies. Crucial mechanisms in each step will be clarified in this paper. Finally, this paper presents experimental results regarding efficiency, scalability, information loss, etc. of the proposed approach to prove its advantages.
A spatial data model for moving object databasesijdms
Moving Object Databases will have significant role in Geospatial Information Systems as they allow users
to model continuous movements of entities in the databases and perform spatio-temporal analysis. For
representing and querying moving objects, an algebra with a comprehensive framework of User Defined
Types together with a set of functions on those types is needed. Moreover, concerning real world
applications, moving objects move along constrained environments like transportation networks so that an
extra algebra for modeling networks is demanded, too. These algebras can be inserted in any data model if
their designs are based on available standards such as Open Geospatial Consortium that provides a
common model for existing DBMS’s. In this paper, we focus on extending a spatial data model for
constrained moving objects. Static and moving geometries in our model are based on Open Geospatial
Consortium standards. We also extend Structured Query Language for retrieving, querying, and
manipulating spatio-temporal data related to moving objects as a simple and expressive query language.
Finally as a proof-of-concept, we implement a generator to generate data for moving objects constrained
by a transportation network. Such a generator primarily aims at traffic planning applications.
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...ijtsrd
Clustering a large sparse and large scale data is an open research in the data mining. To discover the significant information through clustering algorithm stands inadequate as most of the data finds to be non actionable. Existing clustering technique is not feasible to time varying data in high dimensional space. Hence Subspace clustering will be answerable to problems in the clustering through incorporation of domain knowledge and parameter sensitive prediction. Sensitiveness of the data is also predicted through thresholding mechanism. The problems of usability and usefulness in 3D subspace clustering are very important issue in subspace clustering. . The Solutions is highly helpful benefit for police departments and law enforcement organisations to better understand stock issues and provide insights that will enable them to track activities, predict the likelihood. Also determining the correct dimension is inconsistent and challenging issue in subspace clustering .In this thesis, we propose Centroid based Subspace Forecasting Framework by constraints is proposed, i.e. must link and must not link with domain knowledge. Unsupervised Subspace clustering algorithm with inbuilt process like inconsistent constraints correlating to dimensions has been resolved through singular value decomposition. Principle component analysis is been used in which condition has been explored to estimate the strength of actionable to be particular attributes and utilizing the domain knowledge to refinement and validating the optimal centroids dynamically. An experimental result proves that proposed framework outperforms other competition subspace clustering technique in terms of efficiency, Fmeasure, parameter insensitiveness and accuracy. G. Raj Kamal | A. Deepika | D. Pavithra | J. Mohammed Nadeem | V. Prasath Kumar "Principle Component Analysis Based on Optimal Centroid Selection Model for SubSpace Clustering Model" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-4 , June 2020, URL: https://www.ijtsrd.com/papers/ijtsrd31374.pdf Paper Url :https://www.ijtsrd.com/computer-science/data-miining/31374/principle-component-analysis-based-on-optimal-centroid-selection-model-for-subspace-clustering-model/g-raj-kamal
A new model for iris data set classification based on linear support vector m...IJECEIAES
1. The authors propose a new model for classifying the iris data set using a linear support vector machine (SVM) classifier with genetic algorithm optimization of the SVM's C and gamma parameters.
2. Principal component analysis was used to reduce the iris data set features from four to three before classification.
3. The genetic algorithm was shown to optimize the SVM parameters, achieving 98.7% accuracy on the iris data set classification compared to 95.3% accuracy without parameter optimization.
The document proposes a strategy for clustering distributed databases using self-organizing maps (SOM) and K-means algorithms. The strategy applies SOM locally to each distributed data set to obtain representative subsets, then combines the results and applies SOM and K-means globally. Specifically, it performs local SOM clustering, sends representative data to a central site, applies SOM again on the combined data, then uses K-means on the unified map to produce the final clustering result.
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...ijnlc
The tremendous increase in the amount of available research documents impels researchers to propose topic models to extract the latent semantic themes of a documents collection. However, how to extract the hidden topics of the documents collection has become a crucial task for many topic model applications. Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of documents collection increases. In this paper, the Correlated Topic Model with variational ExpectationMaximization algorithm is implemented in MapReduce framework to solve the scalability problem. The proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed approach has a comparable performance in terms of topic coherences with LDA implemented in MapReduce framework.
Evaluate the performance of K-Means and the fuzzy C-Means algorithms to forma...IJECEIAES
The clustering approach is considered as a vital method for wireless sensor networks (WSNs) by organizing the sensor nodes into specific clusters. Consequently, saving the energy and prolonging network lifetime which is totally dependent on the sensors battery, that is considered as a major challenge in the WSNs. Classification algorithms such as K-means (KM) and Fuzzy C-means (FCM), which are two of the most used algorithms in literature for this purpose in WSNs. However, according to the nature of random nodes deployment manner, on certain occasions, this situation forces these algorithms to produce unbalanced clusters, which adversely affects the lifetime of the network. Based for our knowledge, there is no study has analyzed the performance of these algorithms in terms clusters construction in WSNs. In this study, we investigate in KM and FCM performance and which of them has better ability to construct balanced clusters, in order to enable the researchers to choose the appropriate algorithm for the purpose of improving network lifespan. In this study, we utilize new parameters to evaluate the performance of clusters formation in multi-scenarios. Simulation result shows that our FCM is more superior than KM by producing balanced clusters with the random distribution manner for sensor nodes.
This document summarizes an article from the International Journal of Computer Engineering and Technology (IJCET). It discusses clustering algorithms for automatically organizing text documents into meaningful groups.
It begins by introducing the journal and describing document clustering as a technique to organize large amounts of text data. Then, it reviews four prominent clustering algorithms - cliques, single linkage, stars, and connected components - that use similarity metrics to group documents based on a term-by-document matrix.
The document analyzes the performance of these four algorithms on a document corpus, comparing their clustering results and processing times to determine documents' inherent groupings in an efficient manner.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
Apriori is one of the key algorithms to generate frequent itemsets. Analysing frequent itemset is a crucial
step in analysing structured data and in finding association relationship between items. This stands as an
elementary foundation to supervised learning, which encompasses classifier and feature extraction
methods. Applying this algorithm is crucial to understand the behaviour of structured data. Most of the
structured data in scientific domain are voluminous. Processing such kind of data requires state of the art
computing machines. Setting up such an infrastructure is expensive. Hence a distributed environment
such as a clustered setup is employed for tackling such scenarios. Apache Hadoop distribution is one of
the cluster frameworks in distributed environment that helps by distributing voluminous data across a
number of nodes in the framework. This paper focuses on map/reduce design and implementation of
Apriori algorithm for structured data analysis.
Multi-threaded approach in generating frequent itemset of Apriori algorithm b...TELKOMNIKA JOURNAL
This research is about the application of multi-threaded and trie data structures to the support calculation problem in the Apriori algorithm.
The support calculation results can search the association rule for market basket analysis problems. The support calculation process is a bottleneck process and can cause delays in the following process. This work observed five multi-threaded models based on Flynn’s taxonomy, which are single process, multiple data (SPMD), multiple process, single data (MPSD), multiple process, multiple data (MPMD), double SPMD first variant, and double SPMD second variant to shorten the processing time of the support calculation. In addition to the processing time, this works also consider the time difference between each multi-threaded model when the number of item variants increases. The time obtained from the experiment shows that the multi-threaded model that applies a double SPMD variant structure can perform almost three times faster than the multi-threaded model that applies the SPMD structure, MPMD structure, and combination of MPMD and SPMD based on the time difference of 5-itemsets and 10-itemsets experimental result.
This document compares the k-means and grid density clustering algorithms. K-means partitions data into k clusters based on minimizing distances between points and cluster centroids. It works well with numerical data but can be affected by outliers. Grid density determines dense grids based on neighbor densities and can handle different shaped and multi-density clusters without knowing the number of clusters beforehand. It has advantages over k-means in that it can handle categorical data, noise and arbitrary shaped clusters.
This document summarizes an article that proposes using genetic algorithms to improve the effectiveness of a text to matrix generator. It begins with an overview of information retrieval and discusses the vector space model and genetic algorithms. It then proposes a genetic approach to optimize the objective function of a text to matrix generator to increase the average number of terms. The goal is to retrieve more relevant documents by obtaining the best combination of terms from document collections using genetic algorithms. Experimental results are presented to validate that the genetic approach improves the performance of the text to matrix generator.
1) The document discusses different clustering algorithms for text summarization including hierarchical clustering, query-based summarization, graph theoretic clustering, fuzzy c-means clustering, and DBSCAN clustering.
2) These algorithms are evaluated based on performance parameters like precision, recall, time complexity, space complexity, and summary quality.
3) The algorithm found to perform best based on these evaluations will be suggested as the better algorithm for query-dependent text document summarization.
This document discusses GCUBE indexing, which is a method for indexing and aggregating spatial/continuous values in a data warehouse. The key challenges addressed are defining and aggregating spatial/continuous values, and efficiently representing, indexing, updating and querying data that includes both categorical and continuous dimensions. The proposed GCUBE approach maps multi-dimensional data to a linear ordering using the Hilbert curve, and then constructs an index structure on the ordered data to enable efficient query processing. Empirical results show the GCUBE indexing offers significant performance advantages over alternative approaches.
This document summarizes a research paper that proposes a new density-based clustering technique called Triangle-Density Based Clustering Technique (TDCT) to efficiently cluster large spatial datasets. TDCT uses a polygon approach where the number of data points inside each triangle of a polygon is calculated to determine triangle densities. Triangle densities are used to identify clusters based on a density confidence threshold. The technique aims to identify clusters of arbitrary shapes and densities while minimizing computational costs. Experimental results demonstrate the technique's superiority in terms of cluster quality and complexity compared to other density-based clustering algorithms.
Subgraph relative frequency approach for extracting interesting substructurIAEME Publication
The document discusses a Subgraph Relative Frequency (SRF) approach for extracting interesting substructures from molecular data. SRF screens each frequent subgraph to determine if it is interesting based on its relative frequency. This is more efficient than the MISMOC approach, which requires calculating absolute frequencies. SRF was tested on a small molecular data set and found to perform satisfactorily and efficiently for classifying unknown molecules based on their subgraphs. The performance of SRF was comparable to MISMOC but with less computational complexity by using relative rather than absolute frequencies.
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITYIJDKP
This document summarizes an approach to improve source code retrieval using structural information from source code. A lexical parser is developed to extract control statements and method identifiers from Java programs. A similarity measure is proposed that calculates the ratio of fully matching statements to partially matching statements in a sequence. Experiments show the retrieval model using this measure improves retrieval performance over other models by up to 90.9% relative to the number of retrieved methods.
A Soft Set-based Co-occurrence for Clustering Web User TransactionsTELKOMNIKA JOURNAL
This document proposes a soft set-based approach for clustering web user transactions to achieve lower computational complexity and higher clustering purity compared to previous rough set approaches. Unlike rough set approaches that use similarity, the proposed approach uses a co-occurrence approach based on soft set theory. The soft set representation of web user transactions allows modeling as a binary-valued information system. The approach is evaluated in comparison to two previous rough set-based approaches, demonstrating better performance with over 100% lower computational complexity and higher cluster purity.
Multi Label Spatial Semi Supervised Classification using Spatial Associative ...cscpconf
Multi-label spatial classification based on association rules with multi objective genetic
algorithms (MOGA) enriched by semi supervised learning is proposed in this paper. It is to deal
with multiple class labels problem. In this paper we adapt problem transformation for the multi
label classification. We use hybrid evolutionary algorithm for the optimization in the generation
of spatial association rules, which addresses single label. MOGA is used to combine the single
labels into multi labels with the conflicting objectives predictive accuracy and
comprehensibility. Semi supervised learning is done through the process of rule cover
clustering. Finally associative classifier is built with a sorting mechanism. The algorithm is
simulated and the results are compared with MOGA based associative classifier, which out
performs the existing
A h k clustering algorithm for high dimensional data using ensemble learningijitcs
The document summarizes a proposed clustering algorithm for high dimensional data that combines hierarchical (H-K) clustering, subspace clustering, and ensemble clustering. It begins with background on challenges of clustering high dimensional data and related work applying dimension reduction, subspace clustering, ensemble clustering, and H-K clustering individually. The proposed model first applies subspace clustering to identify clusters within subsets of features. It then performs H-K clustering on each subspace cluster. Finally, it applies ensemble clustering techniques to integrate the results into a single clustering. The goal is to leverage each technique's strengths to improve clustering performance for high dimensional data compared to using a single approach.
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MININGIJDKP
This document discusses a hybrid data mining approach called combined mining that can generate informative patterns from complex data sources. It proposes applying three techniques: 1) Using the Lossy-counting algorithm on individual data sources to obtain frequent itemsets, 2) Generating incremental pair and cluster patterns using a multi-feature approach, 3) Combining FP-growth and Bayesian Belief Network using a multi-method approach to generate classifiers. The approach is tested on two datasets to obtain more useful knowledge and the results are compared.
DEVELOPING A NOVEL MULTIDIMENSIONAL MULTIGRANULARITY DATA MINING APPROACH FOR...cscpconf
Data Mining is one of the most significant tools for discovering association patterns that are useful for many knowledge domains. Yet, there are some drawbacks in existing mining techniques. Three main weaknesses of current data-mining techniques are: 1) re-scanning of the entire database must be done whenever new attributes are added. 2) An association rule may be true on a certain granularity but fail on a smaller ones and vise verse. 3) Current methods can only be used to find either frequent rules or infrequent rules, but not both at the same time. This research proposes a novel data schema and an algorithm that solves the above weaknesses while improving on the efficiency and effectiveness of data mining strategies. Crucial mechanisms in each step will be clarified in this paper. Finally, this paper presents experimental results regarding efficiency, scalability, information loss, etc. of the proposed approach to prove its advantages.
A spatial data model for moving object databasesijdms
Moving Object Databases will have significant role in Geospatial Information Systems as they allow users
to model continuous movements of entities in the databases and perform spatio-temporal analysis. For
representing and querying moving objects, an algebra with a comprehensive framework of User Defined
Types together with a set of functions on those types is needed. Moreover, concerning real world
applications, moving objects move along constrained environments like transportation networks so that an
extra algebra for modeling networks is demanded, too. These algebras can be inserted in any data model if
their designs are based on available standards such as Open Geospatial Consortium that provides a
common model for existing DBMS’s. In this paper, we focus on extending a spatial data model for
constrained moving objects. Static and moving geometries in our model are based on Open Geospatial
Consortium standards. We also extend Structured Query Language for retrieving, querying, and
manipulating spatio-temporal data related to moving objects as a simple and expressive query language.
Finally as a proof-of-concept, we implement a generator to generate data for moving objects constrained
by a transportation network. Such a generator primarily aims at traffic planning applications.
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...ijtsrd
Clustering a large sparse and large scale data is an open research in the data mining. To discover the significant information through clustering algorithm stands inadequate as most of the data finds to be non actionable. Existing clustering technique is not feasible to time varying data in high dimensional space. Hence Subspace clustering will be answerable to problems in the clustering through incorporation of domain knowledge and parameter sensitive prediction. Sensitiveness of the data is also predicted through thresholding mechanism. The problems of usability and usefulness in 3D subspace clustering are very important issue in subspace clustering. . The Solutions is highly helpful benefit for police departments and law enforcement organisations to better understand stock issues and provide insights that will enable them to track activities, predict the likelihood. Also determining the correct dimension is inconsistent and challenging issue in subspace clustering .In this thesis, we propose Centroid based Subspace Forecasting Framework by constraints is proposed, i.e. must link and must not link with domain knowledge. Unsupervised Subspace clustering algorithm with inbuilt process like inconsistent constraints correlating to dimensions has been resolved through singular value decomposition. Principle component analysis is been used in which condition has been explored to estimate the strength of actionable to be particular attributes and utilizing the domain knowledge to refinement and validating the optimal centroids dynamically. An experimental result proves that proposed framework outperforms other competition subspace clustering technique in terms of efficiency, Fmeasure, parameter insensitiveness and accuracy. G. Raj Kamal | A. Deepika | D. Pavithra | J. Mohammed Nadeem | V. Prasath Kumar "Principle Component Analysis Based on Optimal Centroid Selection Model for SubSpace Clustering Model" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-4 , June 2020, URL: https://www.ijtsrd.com/papers/ijtsrd31374.pdf Paper Url :https://www.ijtsrd.com/computer-science/data-miining/31374/principle-component-analysis-based-on-optimal-centroid-selection-model-for-subspace-clustering-model/g-raj-kamal
A new model for iris data set classification based on linear support vector m...IJECEIAES
1. The authors propose a new model for classifying the iris data set using a linear support vector machine (SVM) classifier with genetic algorithm optimization of the SVM's C and gamma parameters.
2. Principal component analysis was used to reduce the iris data set features from four to three before classification.
3. The genetic algorithm was shown to optimize the SVM parameters, achieving 98.7% accuracy on the iris data set classification compared to 95.3% accuracy without parameter optimization.
The document proposes a strategy for clustering distributed databases using self-organizing maps (SOM) and K-means algorithms. The strategy applies SOM locally to each distributed data set to obtain representative subsets, then combines the results and applies SOM and K-means globally. Specifically, it performs local SOM clustering, sends representative data to a central site, applies SOM again on the combined data, then uses K-means on the unified map to produce the final clustering result.
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...ijnlc
The tremendous increase in the amount of available research documents impels researchers to propose topic models to extract the latent semantic themes of a documents collection. However, how to extract the hidden topics of the documents collection has become a crucial task for many topic model applications. Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of documents collection increases. In this paper, the Correlated Topic Model with variational ExpectationMaximization algorithm is implemented in MapReduce framework to solve the scalability problem. The proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed approach has a comparable performance in terms of topic coherences with LDA implemented in MapReduce framework.
Evaluate the performance of K-Means and the fuzzy C-Means algorithms to forma...IJECEIAES
The clustering approach is considered as a vital method for wireless sensor networks (WSNs) by organizing the sensor nodes into specific clusters. Consequently, saving the energy and prolonging network lifetime which is totally dependent on the sensors battery, that is considered as a major challenge in the WSNs. Classification algorithms such as K-means (KM) and Fuzzy C-means (FCM), which are two of the most used algorithms in literature for this purpose in WSNs. However, according to the nature of random nodes deployment manner, on certain occasions, this situation forces these algorithms to produce unbalanced clusters, which adversely affects the lifetime of the network. Based for our knowledge, there is no study has analyzed the performance of these algorithms in terms clusters construction in WSNs. In this study, we investigate in KM and FCM performance and which of them has better ability to construct balanced clusters, in order to enable the researchers to choose the appropriate algorithm for the purpose of improving network lifespan. In this study, we utilize new parameters to evaluate the performance of clusters formation in multi-scenarios. Simulation result shows that our FCM is more superior than KM by producing balanced clusters with the random distribution manner for sensor nodes.
This document summarizes an article from the International Journal of Computer Engineering and Technology (IJCET). It discusses clustering algorithms for automatically organizing text documents into meaningful groups.
It begins by introducing the journal and describing document clustering as a technique to organize large amounts of text data. Then, it reviews four prominent clustering algorithms - cliques, single linkage, stars, and connected components - that use similarity metrics to group documents based on a term-by-document matrix.
The document analyzes the performance of these four algorithms on a document corpus, comparing their clustering results and processing times to determine documents' inherent groupings in an efficient manner.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
Apriori is one of the key algorithms to generate frequent itemsets. Analysing frequent itemset is a crucial
step in analysing structured data and in finding association relationship between items. This stands as an
elementary foundation to supervised learning, which encompasses classifier and feature extraction
methods. Applying this algorithm is crucial to understand the behaviour of structured data. Most of the
structured data in scientific domain are voluminous. Processing such kind of data requires state of the art
computing machines. Setting up such an infrastructure is expensive. Hence a distributed environment
such as a clustered setup is employed for tackling such scenarios. Apache Hadoop distribution is one of
the cluster frameworks in distributed environment that helps by distributing voluminous data across a
number of nodes in the framework. This paper focuses on map/reduce design and implementation of
Apriori algorithm for structured data analysis.
Multi-threaded approach in generating frequent itemset of Apriori algorithm b...TELKOMNIKA JOURNAL
This research is about the application of multi-threaded and trie data structures to the support calculation problem in the Apriori algorithm.
The support calculation results can search the association rule for market basket analysis problems. The support calculation process is a bottleneck process and can cause delays in the following process. This work observed five multi-threaded models based on Flynn’s taxonomy, which are single process, multiple data (SPMD), multiple process, single data (MPSD), multiple process, multiple data (MPMD), double SPMD first variant, and double SPMD second variant to shorten the processing time of the support calculation. In addition to the processing time, this works also consider the time difference between each multi-threaded model when the number of item variants increases. The time obtained from the experiment shows that the multi-threaded model that applies a double SPMD variant structure can perform almost three times faster than the multi-threaded model that applies the SPMD structure, MPMD structure, and combination of MPMD and SPMD based on the time difference of 5-itemsets and 10-itemsets experimental result.
Extended pso algorithm for improvement problems k means clustering algorithmIJMIT JOURNAL
The clustering is a without monitoring process and one of the most common data mining techniques. The
purpose of clustering is grouping similar data together in a group, so were most similar to each other in a
cluster and the difference with most other instances in the cluster are. In this paper we focus on clustering
partition k-means, due to ease of implementation and high-speed performance of large data sets, After 30
year it is still very popular among the developed clustering algorithm and then for improvement problem of
placing of k-means algorithm in local optimal, we pose extended PSO algorithm, that its name is ECPSO.
Our new algorithm is able to be cause of exit from local optimal and with high percent produce the
problem’s optimal answer. The probe of results show that mooted algorithm have better performance
regards as other clustering algorithms specially in two index, the carefulness of clustering and the quality
of clustering.
Particle Swarm Optimization based K-Prototype Clustering Algorithm iosrjce
This document summarizes a research paper that proposes a new Particle Swarm Optimization (PSO) based K-Prototype clustering algorithm to cluster mixed numeric and categorical data. It begins with background information on clustering algorithms like K-Means, K-Modes, and K-Prototype. It then describes the K-Prototype algorithm, PSO, and discrete binary PSO. Related work integrating PSO with other clustering algorithms is also reviewed. The proposed approach uses binary PSO to select improved initial prototypes for K-Prototype clustering in order to obtain better clustering results than traditional K-Prototype and avoid local optima.
This document discusses using particle swarm optimization to improve the k-prototype clustering algorithm. The k-prototype algorithm clusters data with both numeric and categorical attributes but can get stuck in local optima. The proposed method uses particle swarm optimization, a global optimization technique, to guide the k-prototype algorithm towards better clusterings. Particle swarm optimization models potential solutions as particles that explore the search space. It is integrated with k-prototype clustering to avoid locally optimal solutions and produce better clusterings. The method is tested on standard benchmark datasets and shown to outperform traditional k-modes and k-prototype clustering algorithms.
Extended pso algorithm for improvement problems k means clustering algorithmIJMIT JOURNAL
The clustering is a without monitoring process and one of the most common data mining techniques. The
purpose of clustering is grouping similar data together in a group, so were most similar to each other in a
cluster and the difference with most other instances in the cluster are. In this paper we focus on clustering
partition k-means, due to ease of implementation and high-speed performance of large data sets, After 30
year it is still very popular among the developed clustering algorithm and then for improvement problem of
placing of k-means algorithm in local optimal, we pose extended PSO algorithm, that its name is ECPSO.
Our new algorithm is able to be cause of exit from local optimal and with high percent produce the
problem’s optimal answer. The probe of results show that mooted algorithm have better performance
regards as other clustering algorithms specially in two index, the carefulness of clustering and the quality
of clustering.
Machine learning for text document classification-efficient classification ap...IAESIJAI
Numerous alternative methods for text classification have been created because of the increase in the amount of online text information available. The cosine similarity classifier is the most extensively utilized simple and efficient approach. It improves text classification performance. It is combined with estimated values provided by conventional classifiers such as Multinomial Naive Bayesian (MNB). Consequently, combining the similarity between a test document and a category with the estimated value for the category enhances the performance of the classifier. This approach provides a text document categorization method that is both efficient and effective. In addition, methods for determining the proper relationship between a set of words in a document and its document categorization is also obtained.
Data mining is the knowledge discovery in databases and the gaol is to extract patterns and knowledge from
large amounts of data. The important term in data mining is text mining. Text mining extracts the quality
information highly from text. Statistical pattern learning is used to high quality information. High –quality in
text mining defines the combinations of relevance, novelty and interestingness. Tasks in text mining are text
categorization, text clustering, entity extraction and sentiment analysis. Applications of natural language
processing and analytical methods are highly preferred to turn
This document discusses using particle swarm optimization (PSO) to design optimal close-range photogrammetry networks. PSO is introduced as a heuristic optimization algorithm inspired by bird flocking behavior that can be used to solve complex optimization problems. The document then provides an overview of close-range photogrammetry network design and the four design stages. It explains that PSO will be used to optimize the first stage of determining optimal camera station positions. Mathematical models of PSO for close-range photogrammetry network design are developed. Experimental tests are carried out to develop a PSO algorithm that can determine optimum camera positions and evaluate the accuracy of the developed network.
The document summarizes text mining techniques in data mining. It discusses common text mining tasks like text categorization, clustering, and entity extraction. It also reviews several text mining algorithms and techniques, including information extraction, clustering, classification, and information visualization. Several literature papers applying these techniques to domains like movie reviews, research proposals, and e-commerce are also summarized. The document concludes that text mining can extract useful patterns from unstructured text through techniques like clustering, classification, and information extraction.
This document summarizes an article that proposes improvements to an existing algorithm for resource scheduling in cloud computing environments. The existing algorithm uses a hybrid of ant colony optimization and particle swarm optimization. The proposed improvements add an initial phase that uses an enhanced fish swarm search algorithm to help find more global optimal solutions. This global optimal solution found by fish swarm search is then used to guide the existing ant colony optimization and particle swarm optimization hybrid to find more locally optimal solutions. The document provides background on resource scheduling, metaheuristic algorithms, and describes the specific implementations of the improved fish swarm search algorithm and the overall proposed methodology.
Survey on classification algorithms for data mining (comparison and evaluation)Alexander Decker
This document provides an overview and comparison of three classification algorithms: K-Nearest Neighbors (KNN), Decision Trees, and Bayesian Networks. It discusses each algorithm, including how KNN classifies data based on its k nearest neighbors. Decision Trees classify data based on a tree structure of decisions, and Bayesian Networks classify data based on probabilities of relationships between variables. The document conducts an analysis of these three algorithms to determine which has the best performance and lowest time complexity for classification tasks based on evaluating a mock dataset over 24 months.
Hybrid multi objectives genetic algorithms and immigrants scheme for dynamic ...khalil IBRAHIM
the main concept of intelligent optimization techniques, artificial neural networks, and new genetic algorithms to solve the multi-objective multicast routing problems with shortest path (SP) problem that used in the addresses networks and improve all processes addressing in the wireless communications based on multi-objective optimization. The most important characteristics in mobile wireless networks is the topology dynamics and the network topology changes over time, the routing problem (SPRP) in mobile ad hoc networks (MANETs) turns out to be a dynamic optimization problem[13], the hybrid immigrants multiple-objective genetic algorithm (HIMOGAs) in the real- world are dynamic in nature, that has objective functions, constraints, and parameters, the dynamic optimization problems (DOPs) are a big challenges to evolutionary multi-objective, since any environmental change may affect the objective vector, constraints, and parameters, HIMOGA for the optimization goal is to track the moving of parameters and get a sequence of approximations solutions over time. The quantity of services (QoS) is supporting guarantee for all data traffic and getting the maximizing utilization for network, the QoS based on multicast routing offers significant challenges, and increases to use an efficient multicast routing protocol that will be able to check multicast routing and satisfying QoS constraints, The author propose to use HIMOGAs and SP algorithm to solve multicast problem that produces new generation wireless networks with immigrants schema to get high-quality solutions after each change and satisfying all objectives.
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSijscmcj
The purpose of this article is to determine the usefulness of the Graphics Processing Unit (GPU) calculations used to implement the Latent Semantic Indexing (LSI) reduction of the TERM-BY DOCUMENT matrix. Considered reduction of the matrix is based on the use of the SVD (Singular Value Decomposition) decomposition. A high computational complexity of the SVD decomposition - O(n3), causes that a reduction of a large indexing structure is a difficult task. In this article there is a comparison of the time complexity and accuracy of the algorithms implemented for two different environments. The first environment is associated with the CPU and MATLAB R2011a. The second environment is related to graphics processors and the CULA library. The calculations were carried out on generally available benchmark matrices, which were combined to achieve the resulting matrix of high size. For both considered environments computations were performed for double and single precision data.
Hybrid features selection method using random forest and meerkat clan algorithmTELKOMNIKA JOURNAL
In the majority of gene expression investigations, selecting relevant genes for sample classification is considered a frequent challenge, with researchers attempting to discover the minimum feasible number of genes while yet achieving excellent predictive performance. Various gene selection methods employ univariate (gene-by-gene) gene relevance rankings as well as arbitrary thresholds for selecting the number of genes, are only applicable to 2-class problems and use gene selection ranking criteria unrelated to the algorithm of classification. A modified random forest (MRF) algorithm depending on the meerkat clan algorithm (MCA) is provided in this work.
It is one of the swarm intelligence algorithms and one of the most significant machine learning approaches in the decision tree. MCA is used to choose characteristics for the RF algorithm. In information systems, databases, and other applications, feature selection imputation is critical. The proposed algorithm was applied to three different databases, where the experimental results for accuracy and time proved the superiority of the proposed algorithm over the original algorithm.
Max stable set problem to found the initial centroids in clustering problemnooriasukmaningtyas
In this paper, we propose a new approach to solve the document-clustering using the K-Means algorithm. The latter is sensitive to the random selection of the k cluster centroids in the initialization phase. To evaluate the quality of K-Means clustering we propose to model the text document clustering problem as the max stable set problem (MSSP) and use continuous Hopfield network to solve the MSSP problem to have initial centroids. The idea is inspired by the fact that MSSP and clustering share the same principle, MSSP consists to find the largest set of nodes completely disconnected in a graph, and in clustering, all objects are divided into disjoint clusters. Simulation results demonstrate that the proposed K-Means improved by MSSP (KM_MSSP) is efficient of large data sets, is much optimized in terms of time, and provides better quality of clustering than other methods.
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET Journal
This document discusses several approaches for clustering textual documents, including:
1. TF-IDF, word embedding, and K-means clustering are proposed to automatically classify and organize documents.
2. Previous work on document clustering is reviewed, including partition-based techniques like K-means and K-medoids, hierarchical clustering, and approaches using semantic features, PSO optimization, and multi-view clustering.
3. Challenges of clustering large document collections at scale are discussed, along with potential solutions using frameworks like Hadoop.
An intelligent auto-response short message service categorization model using...IJECEIAES
Short message service (SMS) is one of the quickest and easiest ways used for communication, used by businesses, government organizations, and banks to send short messages to large groups of people. Categorization of SMS under different message types in their inboxes will provide a concise view for receivers. Former studies on the said problem are at the binary level as ham or spam which triggered the masking of specific messages that were useful to the end user but were treated as spam. Further, it is extended with multi labels such as ham, spam, and others which is not sufficient to meet all the necessities of end users. Hence, a multi-class SMS categorization is needed based on the semantics (information) embedded in it. This paper introduces an intelligent auto-response model using a semantic index for categorizing SMS messages into 5 categories: ham, spam, info, transactions, and one time password’s, using the multi-layer perceptron (MLP) algorithm. In this approach, each SMS is classified into one of the predefined categories. This experiment was conducted on the “multi-class SMS dataset” with 7,398 messages, which are differentiated into 5 classes. The accuracy obtained from the experiment was 97%.
A Novel Approach for Clustering Big Data based on MapReduce IJECEIAES
Clustering is one of the most important applications of data mining. It has attracted attention of researchers in statistics and machine learning. It is used in many applications like information retrieval, image processing and social network analytics etc. It helps the user to understand the similarity and dissimilarity between objects. Cluster analysis makes the users understand complex and large data sets more clearly. There are different types of clustering algorithms analyzed by various researchers. Kmeans is the most popular partitioning based algorithm as it provides good results because of accurate calculation on numerical data. But Kmeans give good results for numerical data only. Big data is combination of numerical and categorical data. Kprototype algorithm is used to deal with numerical as well as categorical data. Kprototype combines the distance calculated from numeric and categorical data. With the growth of data due to social networking websites, business transactions, scientific calculation etc., there is vast collection of structured, semi-structured and unstructured data. So, there is need of optimization of Kprototype so that these varieties of data can be analyzed efficiently.In this work, Kprototype algorithm is implemented on MapReduce in this paper. Experiments have proved that Kprototype implemented on Mapreduce gives better performance gain on multiple nodes as compared to single node. CPU execution time and speedup are used as evaluation metrics for comparison.Intellegent splitter is proposed in this paper which splits mixed big data into numerical and categorical data. Comparison with traditional algorithms proves that proposed algorithm works better for large scale of data.
Job Scheduling on the Grid Environment using Max-Min Firefly AlgorithmEditor IJCATR
Grid computing indeed is the next generation of distributed systems and its goals is creating a powerful virtual, great, and
autonomous computer that is created using countless Heterogeneous resource with the purpose of sharing resources. Scheduling is one
of the main steps to exploit the capabilities of emerging computing systems such as the grid. Scheduling of the jobs in computational
grids due to Heterogeneous resources is known as an NP-Complete problem. Grid resources belong to different management domains
and each applies different management policies. Since the nature of the grid is Heterogeneous and dynamic, techniques used in
traditional systems cannot be applied to grid scheduling, therefore new methods must be found. This paper proposes a new algorithm
which combines the firefly algorithm with the Max-Min algorithm for scheduling of jobs on the grid. The firefly algorithm is a new
technique based on the swarm behavior that is inspired by social behavior of fireflies in nature. Fireflies move in the search space of
problem to find the optimal or near-optimal solutions. Minimization of the makespan and flowtime of completing jobs simultaneously
are the goals of this paper. Experiments and simulation results show that the proposed method has a better efficiency than other
compared algorithms.
Similar to Text documents clustering using modified multi-verse optimizer (20)
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...IJECEIAES
Medical image analysis has witnessed significant advancements with deep learning techniques. In the domain of brain tumor segmentation, the ability to
precisely delineate tumor boundaries from magnetic resonance imaging (MRI)
scans holds profound implications for diagnosis. This study presents an ensemble convolutional neural network (CNN) with transfer learning, integrating
the state-of-the-art Deeplabv3+ architecture with the ResNet18 backbone. The
model is rigorously trained and evaluated, exhibiting remarkable performance
metrics, including an impressive global accuracy of 99.286%, a high-class accuracy of 82.191%, a mean intersection over union (IoU) of 79.900%, a weighted
IoU of 98.620%, and a Boundary F1 (BF) score of 83.303%. Notably, a detailed comparative analysis with existing methods showcases the superiority of
our proposed model. These findings underscore the model’s competence in precise brain tumor localization, underscoring its potential to revolutionize medical
image analysis and enhance healthcare outcomes. This research paves the way
for future exploration and optimization of advanced CNN models in medical
imaging, emphasizing addressing false positives and resource efficiency.
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
Neural network optimizer of proportional-integral-differential controller par...IJECEIAES
Wide application of proportional-integral-differential (PID)-regulator in industry requires constant improvement of methods of its parameters adjustment. The paper deals with the issues of optimization of PID-regulator parameters with the use of neural network technology methods. A methodology for choosing the architecture (structure) of neural network optimizer is proposed, which consists in determining the number of layers, the number of neurons in each layer, as well as the form and type of activation function. Algorithms of neural network training based on the application of the method of minimizing the mismatch between the regulated value and the target value are developed. The method of back propagation of gradients is proposed to select the optimal training rate of neurons of the neural network. The neural network optimizer, which is a superstructure of the linear PID controller, allows increasing the regulation accuracy from 0.23 to 0.09, thus reducing the power consumption from 65% to 53%. The results of the conducted experiments allow us to conclude that the created neural superstructure may well become a prototype of an automatic voltage regulator (AVR)-type industrial controller for tuning the parameters of the PID controller.
An improved modulation technique suitable for a three level flying capacitor ...IJECEIAES
This research paper introduces an innovative modulation technique for controlling a 3-level flying capacitor multilevel inverter (FCMLI), aiming to streamline the modulation process in contrast to conventional methods. The proposed
simplified modulation technique paves the way for more straightforward and
efficient control of multilevel inverters, enabling their widespread adoption and
integration into modern power electronic systems. Through the amalgamation of
sinusoidal pulse width modulation (SPWM) with a high-frequency square wave
pulse, this controlling technique attains energy equilibrium across the coupling
capacitor. The modulation scheme incorporates a simplified switching pattern
and a decreased count of voltage references, thereby simplifying the control
algorithm.
A review on features and methods of potential fishing zoneIJECEIAES
This review focuses on the importance of identifying potential fishing zones in seawater for sustainable fishing practices. It explores features like sea surface temperature (SST) and sea surface height (SSH), along with classification methods such as classifiers. The features like SST, SSH, and different classifiers used to classify the data, have been figured out in this review study. This study underscores the importance of examining potential fishing zones using advanced analytical techniques. It thoroughly explores the methodologies employed by researchers, covering both past and current approaches. The examination centers on data characteristics and the application of classification algorithms for classification of potential fishing zones. Furthermore, the prediction of potential fishing zones relies significantly on the effectiveness of classification algorithms. Previous research has assessed the performance of models like support vector machines, naïve Bayes, and artificial neural networks (ANN). In the previous result, the results of support vector machine (SVM) were 97.6% more accurate than naive Bayes's 94.2% to classify test data for fisheries classification. By considering the recent works in this area, several recommendations for future works are presented to further improve the performance of the potential fishing zone models, which is important to the fisheries community.
Electrical signal interference minimization using appropriate core material f...IJECEIAES
As demand for smaller, quicker, and more powerful devices rises, Moore's law is strictly followed. The industry has worked hard to make little devices that boost productivity. The goal is to optimize device density. Scientists are reducing connection delays to improve circuit performance. This helped them understand three-dimensional integrated circuit (3D IC) concepts, which stack active devices and create vertical connections to diminish latency and lower interconnects. Electrical involvement is a big worry with 3D integrates circuits. Researchers have developed and tested through silicon via (TSV) and substrates to decrease electrical wave involvement. This study illustrates a novel noise coupling reduction method using several electrical involvement models. A 22% drop in electrical involvement from wave-carrying to victim TSVs introduces this new paradigm and improves system performance even at higher THz frequencies.
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
Bibliometric analysis highlighting the role of women in addressing climate ch...IJECEIAES
Fossil fuel consumption increased quickly, contributing to climate change
that is evident in unusual flooding and draughts, and global warming. Over
the past ten years, women's involvement in society has grown dramatically,
and they succeeded in playing a noticeable role in reducing climate change.
A bibliometric analysis of data from the last ten years has been carried out to
examine the role of women in addressing the climate change. The analysis's
findings discussed the relevant to the sustainable development goals (SDGs),
particularly SDG 7 and SDG 13. The results considered contributions made
by women in the various sectors while taking geographic dispersion into
account. The bibliometric analysis delves into topics including women's
leadership in environmental groups, their involvement in policymaking, their
contributions to sustainable development projects, and the influence of
gender diversity on attempts to mitigate climate change. This study's results
highlight how women have influenced policies and actions related to climate
change, point out areas of research deficiency and recommendations on how
to increase role of the women in addressing the climate change and
achieving sustainability. To achieve more successful results, this initiative
aims to highlight the significance of gender equality and encourage
inclusivity in climate change decision-making processes.
Voltage and frequency control of microgrid in presence of micro-turbine inter...IJECEIAES
The active and reactive load changes have a significant impact on voltage
and frequency. In this paper, in order to stabilize the microgrid (MG) against
load variations in islanding mode, the active and reactive power of all
distributed generators (DGs), including energy storage (battery), diesel
generator, and micro-turbine, are controlled. The micro-turbine generator is
connected to MG through a three-phase to three-phase matrix converter, and
the droop control method is applied for controlling the voltage and
frequency of MG. In addition, a method is introduced for voltage and
frequency control of micro-turbines in the transition state from gridconnected mode to islanding mode. A novel switching strategy of the matrix
converter is used for converting the high-frequency output voltage of the
micro-turbine to the grid-side frequency of the utility system. Moreover,
using the switching strategy, the low-order harmonics in the output current
and voltage are not produced, and consequently, the size of the output filter
would be reduced. In fact, the suggested control strategy is load-independent
and has no frequency conversion restrictions. The proposed approach for
voltage and frequency regulation demonstrates exceptional performance and
favorable response across various load alteration scenarios. The suggested
strategy is examined in several scenarios in the MG test systems, and the
simulation results are addressed.
Enhancing battery system identification: nonlinear autoregressive modeling fo...IJECEIAES
Precisely characterizing Li-ion batteries is essential for optimizing their
performance, enhancing safety, and prolonging their lifespan across various
applications, such as electric vehicles and renewable energy systems. This
article introduces an innovative nonlinear methodology for system
identification of a Li-ion battery, employing a nonlinear autoregressive with
exogenous inputs (NARX) model. The proposed approach integrates the
benefits of nonlinear modeling with the adaptability of the NARX structure,
facilitating a more comprehensive representation of the intricate
electrochemical processes within the battery. Experimental data collected
from a Li-ion battery operating under diverse scenarios are employed to
validate the effectiveness of the proposed methodology. The identified
NARX model exhibits superior accuracy in predicting the battery's behavior
compared to traditional linear models. This study underscores the
importance of accounting for nonlinearities in battery modeling, providing
insights into the intricate relationships between state-of-charge, voltage, and
current under dynamic conditions.
Smart grid deployment: from a bibliometric analysis to a surveyIJECEIAES
Smart grids are one of the last decades' innovations in electrical energy.
They bring relevant advantages compared to the traditional grid and
significant interest from the research community. Assessing the field's
evolution is essential to propose guidelines for facing new and future smart
grid challenges. In addition, knowing the main technologies involved in the
deployment of smart grids (SGs) is important to highlight possible
shortcomings that can be mitigated by developing new tools. This paper
contributes to the research trends mentioned above by focusing on two
objectives. First, a bibliometric analysis is presented to give an overview of
the current research level about smart grid deployment. Second, a survey of
the main technological approaches used for smart grid implementation and
their contributions are highlighted. To that effect, we searched the Web of
Science (WoS), and the Scopus databases. We obtained 5,663 documents
from WoS and 7,215 from Scopus on smart grid implementation or
deployment. With the extraction limitation in the Scopus database, 5,872 of
the 7,215 documents were extracted using a multi-step process. These two
datasets have been analyzed using a bibliometric tool called bibliometrix.
The main outputs are presented with some recommendations for future
research.
Use of analytical hierarchy process for selecting and prioritizing islanding ...IJECEIAES
One of the problems that are associated to power systems is islanding
condition, which must be rapidly and properly detected to prevent any
negative consequences on the system's protection, stability, and security.
This paper offers a thorough overview of several islanding detection
strategies, which are divided into two categories: classic approaches,
including local and remote approaches, and modern techniques, including
techniques based on signal processing and computational intelligence.
Additionally, each approach is compared and assessed based on several
factors, including implementation costs, non-detected zones, declining
power quality, and response times using the analytical hierarchy process
(AHP). The multi-criteria decision-making analysis shows that the overall
weight of passive methods (24.7%), active methods (7.8%), hybrid methods
(5.6%), remote methods (14.5%), signal processing-based methods (26.6%),
and computational intelligent-based methods (20.8%) based on the
comparison of all criteria together. Thus, it can be seen from the total weight
that hybrid approaches are the least suitable to be chosen, while signal
processing-based methods are the most appropriate islanding detection
method to be selected and implemented in power system with respect to the
aforementioned factors. Using Expert Choice software, the proposed
hierarchy model is studied and examined.
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...IJECEIAES
The power generated by photovoltaic (PV) systems is influenced by
environmental factors. This variability hampers the control and utilization of
solar cells' peak output. In this study, a single-stage grid-connected PV
system is designed to enhance power quality. Our approach employs fuzzy
logic in the direct power control (DPC) of a three-phase voltage source
inverter (VSI), enabling seamless integration of the PV connected to the
grid. Additionally, a fuzzy logic-based maximum power point tracking
(MPPT) controller is adopted, which outperforms traditional methods like
incremental conductance (INC) in enhancing solar cell efficiency and
minimizing the response time. Moreover, the inverter's real-time active and
reactive power is directly managed to achieve a unity power factor (UPF).
The system's performance is assessed through MATLAB/Simulink
implementation, showing marked improvement over conventional methods,
particularly in steady-state and varying weather conditions. For solar
irradiances of 500 and 1,000 W/m2
, the results show that the proposed
method reduces the total harmonic distortion (THD) of the injected current
to the grid by approximately 46% and 38% compared to conventional
methods, respectively. Furthermore, we compare the simulation results with
IEEE standards to evaluate the system's grid compatibility.
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...IJECEIAES
Photovoltaic systems have emerged as a promising energy resource that
caters to the future needs of society, owing to their renewable, inexhaustible,
and cost-free nature. The power output of these systems relies on solar cell
radiation and temperature. In order to mitigate the dependence on
atmospheric conditions and enhance power tracking, a conventional
approach has been improved by integrating various methods. To optimize
the generation of electricity from solar systems, the maximum power point
tracking (MPPT) technique is employed. To overcome limitations such as
steady-state voltage oscillations and improve transient response, two
traditional MPPT methods, namely fuzzy logic controller (FLC) and perturb
and observe (P&O), have been modified. This research paper aims to
simulate and validate the step size of the proposed modified P&O and FLC
techniques within the MPPT algorithm using MATLAB/Simulink for
efficient power tracking in photovoltaic systems.
Adaptive synchronous sliding control for a robot manipulator based on neural ...IJECEIAES
Robot manipulators have become important equipment in production lines, medical fields, and transportation. Improving the quality of trajectory tracking for
robot hands is always an attractive topic in the research community. This is a
challenging problem because robot manipulators are complex nonlinear systems
and are often subject to fluctuations in loads and external disturbances. This
article proposes an adaptive synchronous sliding control scheme to improve trajectory tracking performance for a robot manipulator. The proposed controller
ensures that the positions of the joints track the desired trajectory, synchronize
the errors, and significantly reduces chattering. First, the synchronous tracking
errors and synchronous sliding surfaces are presented. Second, the synchronous
tracking error dynamics are determined. Third, a robust adaptive control law is
designed,the unknown components of the model are estimated online by the neural network, and the parameters of the switching elements are selected by fuzzy
logic. The built algorithm ensures that the tracking and approximation errors
are ultimately uniformly bounded (UUB). Finally, the effectiveness of the constructed algorithm is demonstrated through simulation and experimental results.
Simulation and experimental results show that the proposed controller is effective with small synchronous tracking errors, and the chattering phenomenon is
significantly reduced.
Remote field-programmable gate array laboratory for signal acquisition and de...IJECEIAES
A remote laboratory utilizing field-programmable gate array (FPGA) technologies enhances students’ learning experience anywhere and anytime in embedded system design. Existing remote laboratories prioritize hardware access and visual feedback for observing board behavior after programming, neglecting comprehensive debugging tools to resolve errors that require internal signal acquisition. This paper proposes a novel remote embeddedsystem design approach targeting FPGA technologies that are fully interactive via a web-based platform. Our solution provides FPGA board access and debugging capabilities beyond the visual feedback provided by existing remote laboratories. We implemented a lab module that allows users to seamlessly incorporate into their FPGA design. The module minimizes hardware resource utilization while enabling the acquisition of a large number of data samples from the signal during the experiments by adaptively compressing the signal prior to data transmission. The results demonstrate an average compression ratio of 2.90 across three benchmark signals, indicating efficient signal acquisition and effective debugging and analysis. This method allows users to acquire more data samples than conventional methods. The proposed lab allows students to remotely test and debug their designs, bridging the gap between theory and practice in embedded system design.
Detecting and resolving feature envy through automated machine learning and m...IJECEIAES
Efficiently identifying and resolving code smells enhances software project quality. This paper presents a novel solution, utilizing automated machine learning (AutoML) techniques, to detect code smells and apply move method refactoring. By evaluating code metrics before and after refactoring, we assessed its impact on coupling, complexity, and cohesion. Key contributions of this research include a unique dataset for code smell classification and the development of models using AutoGluon for optimal performance. Furthermore, the study identifies the top 20 influential features in classifying feature envy, a well-known code smell, stemming from excessive reliance on external classes. We also explored how move method refactoring addresses feature envy, revealing reduced coupling and complexity, and improved cohesion, ultimately enhancing code quality. In summary, this research offers an empirical, data-driven approach, integrating AutoML and move method refactoring to optimize software project quality. Insights gained shed light on the benefits of refactoring on code quality and the significance of specific features in detecting feature envy. Future research can expand to explore additional refactoring techniques and a broader range of code metrics, advancing software engineering practices and standards.
Smart monitoring technique for solar cell systems using internet of things ba...IJECEIAES
Rapidly and remotely monitoring and receiving the solar cell systems status parameters, solar irradiance, temperature, and humidity, are critical issues in enhancement their efficiency. Hence, in the present article an improved smart prototype of internet of things (IoT) technique based on embedded system through NodeMCU ESP8266 (ESP-12E) was carried out experimentally. Three different regions at Egypt; Luxor, Cairo, and El-Beheira cities were chosen to study their solar irradiance profile, temperature, and humidity by the proposed IoT system. The monitoring data of solar irradiance, temperature, and humidity were live visualized directly by Ubidots through hypertext transfer protocol (HTTP) protocol. The measured solar power radiation in Luxor, Cairo, and El-Beheira ranged between 216-1000, 245-958, and 187-692 W/m 2 respectively during the solar day. The accuracy and rapidity of obtaining monitoring results using the proposed IoT system made it a strong candidate for application in monitoring solar cell systems. On the other hand, the obtained solar power radiation results of the three considered regions strongly candidate Luxor and Cairo as suitable places to build up a solar cells system station rather than El-Beheira.
An efficient security framework for intrusion detection and prevention in int...IJECEIAES
Over the past few years, the internet of things (IoT) has advanced to connect billions of smart devices to improve quality of life. However, anomalies or malicious intrusions pose several security loopholes, leading to performance degradation and threat to data security in IoT operations. Thereby, IoT security systems must keep an eye on and restrict unwanted events from occurring in the IoT network. Recently, various technical solutions based on machine learning (ML) models have been derived towards identifying and restricting unwanted events in IoT. However, most ML-based approaches are prone to miss-classification due to inappropriate feature selection. Additionally, most ML approaches applied to intrusion detection and prevention consider supervised learning, which requires a large amount of labeled data to be trained. Consequently, such complex datasets are impossible to source in a large network like IoT. To address this problem, this proposed study introduces an efficient learning mechanism to strengthen the IoT security aspects. The proposed algorithm incorporates supervised and unsupervised approaches to improve the learning models for intrusion detection and mitigation. Compared with the related works, the experimental outcome shows that the model performs well in a benchmark dataset. It accomplishes an improved detection accuracy of approximately 99.21%.
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTjpsjournal1
The rivalry between prominent international actors for dominance over Central Asia's hydrocarbon
reserves and the ancient silk trade route, along with China's diplomatic endeavours in the area, has been
referred to as the "New Great Game." This research centres on the power struggle, considering
geopolitical, geostrategic, and geoeconomic variables. Topics including trade, political hegemony, oil
politics, and conventional and nontraditional security are all explored and explained by the researcher.
Using Mackinder's Heartland, Spykman Rimland, and Hegemonic Stability theories, examines China's role
in Central Asia. This study adheres to the empirical epistemological method and has taken care of
objectivity. This study analyze primary and secondary research documents critically to elaborate role of
china’s geo economic outreach in central Asian countries and its future prospect. China is thriving in trade,
pipeline politics, and winning states, according to this study, thanks to important instruments like the
Shanghai Cooperation Organisation and the Belt and Road Economic Initiative. According to this study,
China is seeing significant success in commerce, pipeline politics, and gaining influence on other
governments. This success may be attributed to the effective utilisation of key tools such as the Shanghai
Cooperation Organisation and the Belt and Road Economic Initiative.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
A review on techniques and modelling methodologies used for checking electrom...nooriasukmaningtyas
The proper function of the integrated circuit (IC) in an inhibiting electromagnetic environment has always been a serious concern throughout the decades of revolution in the world of electronics, from disjunct devices to today’s integrated circuit technology, where billions of transistors are combined on a single chip. The automotive industry and smart vehicles in particular, are confronting design issues such as being prone to electromagnetic interference (EMI). Electronic control devices calculate incorrect outputs because of EMI and sensors give misleading values which can prove fatal in case of automotives. In this paper, the authors have non exhaustively tried to review research work concerned with the investigation of EMI in ICs and prediction of this EMI using various modelling methodologies and measurement setups.
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
Text documents clustering using modified multi-verse optimizer
1. International Journal of Electrical and Computer Engineering (IJECE)
Vol. 10, No. 6, December 2020, pp. 6361∼6369
ISSN: 2088-8708, DOI: 10.11591/ijece.v10i6.pp6361-6369 Ì 6361
Text documents clustering using modified multi-verse
optimizer
Ammar Kamal Abasi1
, Ahamad Tajudin Khader2
, Mohammed Azmi Al-Betar3
, Syibrah Naim4
,
Mohammed A. Awadallah5
, Osama Ahmad Alomari6
1,2
School of Computer Sciences, Universiti Sains Malaysia (USM), Malaysia
3
Department of information technology, Al-Huson University College, Jordan
4
Technology Department, Endicott College of International Studies (ECIS), Woosong University, Korea
5
Department of Computer Science, Al-Aqsa University, Palestine
6
Department of Computer Engineering,Faculty of Engineering and Architecture, Istanbul Gelisim University, Turkey
Article Info
Article history:
Received Mar 29, 2020
Revised May 2, 2020
Accepted May 18, 2020
Keywords:
Multi-verse optimizer
Optimization
Swarm intelligence
Test document clustering
ABSTRACT
In this study, a multi-verse optimizer (MVO) is utilised for the text document clus-
tering (TDC) problem. TDC is treated as a discrete optimization problem, and an
objective function based on the Euclidean distance is applied as similarity measure.
TDC is tackled by the division of the documents into clusters; documents belong-
ing to the same cluster are similar, whereas those belonging to different clusters are
dissimilar. MVO, which is a recent metaheuristic optimization algorithm established
for continuous optimization problems, can intelligently navigate different areas in the
search space and search deeply in each area using a particular learning mechanism.
The proposed algorithm is called MVOTDC, and it adopts the convergence behaviour
of MVO operators to deal with discrete, rather than continuous, optimization prob-
lems. For evaluating MVOTDC, a comprehensive comparative study is conducted on
six text document datasets with various numbers of documents and clusters. The qual-
ity of the final results is assessed using precision, recall, F-measure, entropy accuracy,
and purity measures. Experimental results reveal that the proposed method performs
competitively in comparison with state-of-the-art algorithms. Statistical analysis is
also conducted and shows that MVOTDC can produce significant results in compari-
son with three well-established methods.
Copyright c 2020 Insitute of Advanced Engineeering and Science.
All rights reserved.
Corresponding Author:
Ammar Kamal Abasi,
School of Computer Sciences,
Universiti Sains Malaysia (USM),
11800 Pulau Pinang, Malaysia.
Email: ammar abasi@student.usm.my
1. INTRODUCTION
In the current digital era, massive amounts of online text documents inundate the web every day.
Manipulating these text documents is important for improving the query results returned by search engines,
unsupervised text organisation systems, text classification, text summarization, knowledge extraction processes,
information retrieval services and text mining processes, scientific document clustering [1]. Many approaches
have been proposed for unsupervised organising text documents.
Text document clustering (TDC) is an effective and efficient technique used by researchers in this
domain [2], this field of text mining enables the organisation of large amounts of textual data. It can be defined
as an unsupervised automatic document clustering technique that utilises the document similarities rule to
divide documents into homogeneous clusters. In other words, text documents in the same cluster are similar,
Journal homepage: http://ijece.iaescore.com
2. 6362 Ì ISSN: 2088-8708
whereas those in different clusters are dissimilar [3]. Conventionally, clustering methods can be classified into
two main groups: (i) partitional clustering and (ii) hierarchical clustering.
K-means and K-medoids are simple and easy-to-use methods that can be tailored to suit large-scale
text document datasets. They are iterative clustering-based techniques initiated with predefined numbers of
cluster centroids. At each iteration, documents are distributed into clusters according to similarity functions,
depending on the distance between each centroid and its closest document. Then, the cluster centroid is iter-
atively updated according to the documents belonging to the same cluster. This operation is stopped when all
the documents are moved into the right cluster by means of stagnated cluster centroids. The main shortcoming
of these methods is their convergence behaviour; they move in one direction in a single search space region
and do not perform a wider scan in the whole search space region. Therefore, they can easily become trapped
in local optima due to the unknown shapes of search spaces. Given that K-means is a local search area method
and TDC is formulated as an optimization problem [4], optimization methods that can escape the local optima
can be utilized for TDC [5].
The most successful algorithms recently utilized for TDC are metaheuristic-based algorithms [6]. The
first type of metaheuristic algorithms is evolutionary-based algorithms (EA), which is initiated with a group of
provisional individuals called population. Generation after generation, the population is evolved on the basis
of three main operators: recombination for mixing the individual features, mutation for diversifying the search
and selection for utilising the survival-of-the-fittest principle. The EA is stopped when no further evolution can
be achieved. The main shortcoming of EAs is that although they can simultaneously navigate several areas in
the search space, they cannot perform deep searching in each area to which they navigate. Consequently, EAs
mostly suffer from premature convergence. EAs that have been successfully utilized for TDC include genetic
algorithm (GA) [7] and harmony search [8].
The second type of metaheuristic algorithms is trajectory-based algorithms (TA); a single solution is
usedto launch such an algorithm. This solution is improved by repetition using neighbouring-moves operators
until a local optimal solution that is in the same search space region, is reached [9]. While TAs can extensively
search the initial solution search region and achieve local optima, they cannot navigate simultaneously numer-
ous search space regions. The main TAs utilized for TDC are K-means and K-medoids. Other TAs used for
TDC are self-organizing maps (som) [10], and β-hill climbing.
The last type of metaheuristic algorithms is swarm intelligence (SI); an SI algorithm is also initiated
with a set of random solutions called a swarm. Iteration after iteration, the solutions in the swarm are recon-
structed by means of attracting them by the best solutions that are so far found [11]. SI-based algorithms can
easily converge prematurely. Several SI-based TDC are utilized, such as particle swarm optimization (PSO)
[12] and artificial bee colony [13].
The multi-verse optimizer (MVO) algorithm was recently proposed as a stochastic population-based
algorithm [14] inspired by multi-verse theory. The big bang theory explains the origin of the universe to have
been a massive explosion. According to this theory, the origin of everything in our universe requires one
big bang. Multi-verse theory believes that more than one explosion (big bang) occurred, with each big bang
creating a new and independent universe. This theory is modelled as an optimization algorithm with three
concepts: white hole, black hole and wormhole, for performing exploration, exploitation and local search,
respectively. the MVO has been utilized for a wide range of optimization problems, such as identifying the
optimal parameters of PEMFC stacks [15], unmanned aerial vehicle path planning [16], clustering problems
[17], feature selection [18], neural networks [19], and optimising SVM parameters [18].
This paper adapts the MVO algorithm for the TDC problem using Euclidean distance as similarity
measure. The adaptation includes modifying the convergence behaviour of MVO operators to deal with the
discrete, rather than continuous, optimization problem. The main advantage of the proposed method is that it
improves the quality of final outcomes for TDC problems. A comprehensive comparative study is conducted
on six text document benchmark datasets that have different numbers of clusters and documents. The quality of
the final results is analysed with a discussion using accuracy, precision, recall, F-measure, purity and entropy
criteria. The findings of the experimental analyses reveal that the proposed method performs competitively in
comparison with state-of-the-art algorithms.
The rest of this paper is organised as follows. Section 2. presents the structure of MVO for TDC.
Section 3. discusses the experiments results of MVO. Section 4. gives the conclusion of this study and future
work for the authors.
Int J Elec & Comp Eng, Vol. 10, No. 6, December 2020 : 6361 – 6369
3. Int J Elec & Comp Eng ISSN: 2088-8708 Ì 6363
2. MULTI-VERSE OPTIMIZER FOR TEXT DOCUMENT CLUSTERING
This section describes the main components utilized for tackling TDC. The term ‘components’ is used
to denote the adaptation elements that are conducted for solving the TDC problem using the MVO algorithm
and their sequence, including (i) TDC pre-processing, (ii) solution representation and (iii) calculation of the
objective function (evaluation of the solutions). Finally, the question of how MVO is adapted to TDC is
addressed.
2.1. Text document clustering (TDC)
TDC aims to divide documents into clusters; each cluster has similar documents, whereas documents
in different clusters are dissimilar [20, 21].
In the use of any clustering algorithm for TDC, the text needs some necessary and preliminary steps
(pre-processing); this step filters unnecessary data, such as special formatting, special characters and numbers,
out of the text. Thereafter, the pre-processed text document terms are converted into numerical form for further
processing. The main goal of this step is to improve the quality of features and reduce the implementation
complexity of the TDC algorithm [22]. The text pre-processing includes tokenization, stop word removal and
stemming, which are discussed in detail in the succeeding subsections.
2.1.1. Tokenization Phase
In the tokenization phase, each document is broken down into a set of tokens (words), where the token
is any sequence of characters separated by spaces. Each document is then formulated as a word instance count
(pieces) as a bag-of-words model [23]. Note that the word instance count is filtered through removal of empty
sequences, number formatting and collapsing, among other tasks [24].
2.1.2. Stop word removal Phase
In the stop words removal phase, commonly repeated terms (e.g., ‘a’, ‘an’, ‘the’, ‘who’, ‘be’, ‘about’,
‘again’, ‘any’, ‘against’), pronouns (e.g. ‘she’, ‘he’, ‘it’), conjunctions (e.g. ‘but’, ‘and’, ‘or’, ) and similar
words are removed because they have high frequencies and negatively affect the clustering process (i.e. they
hinder the clustering algorithm.). This process improves the clustering performance and reduces the number of
processed words or terms [22].
2.1.3. Stemming Phase
Stemming is the process of decomposing terms to their roots by removal of affixes (prefixes and
suffixes) [25]. For example, the root of the word ‘stemming’ is ‘stem’. In the English language, many terms
may share the same root; for example, the words ‘connects’, ‘connected’ and ‘connecting’ all stem from the
same root, which is ‘connect’ in www.text-processing.com/demo/stem/. The stemming process attempts to
improve the clustering by reducing the number of different terms that have similar grammatical properties and
stem from a single term [26].
2.1.4. Solution representation
Each solution is represented as a vector x = (x1, x2, . . . , xd), where d is the number of documents.
Figure 1 shows an example of a solution representation. In this example, five clusters contain twenty document;
for example, cluster two has three documents {4, 8, 9}, and document 10 (i.e., x10) is in cluster three.
Figure 1. Solution representation
2.1.5. Objective function
In this study, for each solution x the objective function is calculated using the average distance of
documents to the cluster centroid (ADDC) as shown in (1).
min f(x) =
k
i=1( 1
ni
ni
j=1 D(Cki, dj))
k
(1)
Text documents clustering using... (Ammar Kamal Abasi)
4. 6364 Ì ISSN: 2088-8708
where D(Cki, dj) is the distance between the cluster centroid j and document i, ni is the number
of documents in cluster i, k is the number of clusters, and f(x) is the objective function (i.e. minimize the
distance).
2.2. Multi-verse in optimization context
MVO [14] inspired by multi-verse theory. According to multi-verse theory, universes connect and
might even collide with each other. MVO engages and reformulates using three main concepts: wormholes,
black holes, and white holes. The probability is used for determining the inflation rate (corresponds to the
objective function in the optimization context) for the universe, thereby allowing the universe to assign one of
these holes. Given that the universe has a high inflation rate, the probability of a white hole existing increases.
Meanwhile, a low inflation rate leads to increased probability of a black hole existing [14]. Regardless of the
universe’s inflation rate, wormholes move objects towards the best universe randomly [15, 27]. The black and
white hole concepts in MVO are formulated for exploring search spaces, and the wormhole concept is formu-
lated for exploiting search spaces. In other EAs, MVO is initiated by a population of individuals (universes).
Thereafter, MVO improves these solutions until a stopping criterion. The conceptual model of the MVO in [14]
shows the movements of the objects between the universes via white/black hole tunnels. These hole tunnels
are created between two universes on the basis of the inflation rate of each universe (i.e. one universe has a
higher inflation rate than the other universes.). Objects move from universes with high inflation rates using
white holes. These objects are received by universes with low inflation rates using black holes.
After a population of solutions is initiated, all solutions in MVO are sorted from high inflation rates
to low ones. Thereafter, it visits the solutions one by one to attract these solution to the best one. This is done
under the assumption that the solution that has been visited has the black hole. As for the white holes, the
roulette wheel mechanism is used for selecting one solution.
2.3. Adapting MVO for TDC
After the pre-processing step, MVO is used to split documents into their parent clusters. In this
study, the solution representation and the objective function formulated above are used. The steps of classical
MVO in [14] are adopted for TDC with certain modifications. These modifications are related to the nature
of the problem variables. Given that the clustering problem is discrete in nature [28] and MVO was originally
proposed for continuous optimization problems, MVO should deal with discrete values of the decision variables
of each TDC solution. During the MVO execution, the generation function and the wormhole equations (2) is
adjusted for deciding the feasible solution as follows:
xj
i =
xj + TDR × ((ubj − lbj) × r4 + lbj)Bigr r3 < 0.5,
xj − TDR × ((ubj − lbj) × r4 + lbj)Bigr r3 ≥ 0.5
r2 < WEP
xj
i r2 ≥ WEP
(2)
A general overview of MVO for TDC is provided via Figure 2, which visualises the procedural steps.
Figure 2. Process of MVOTDC
Int J Elec & Comp Eng, Vol. 10, No. 6, December 2020 : 6361 – 6369
5. Int J Elec & Comp Eng ISSN: 2088-8708 Ì 6365
3. EXPERIMENTAL RESULTS
For evaluating the performance of the proposed method, a set of designed experiments is conducted
using six instances of standard datasets formulated for measuring the performance of text clustering techniques.
Six evaluation measures are used, as conventionally done: precision, recall, F-measure, entropy accuracy, and
purity criteria. For comparative evaluation, results obtained in terms of evaluation measures are compared with
those obtained by three state-of-the-art algorithms (K-means clustering, GA and PSO) using the same objective
function. The experiments are conducted using the programming language MATLAB. Thorough descriptions
about the experimental results are given in the following subsections.
3.1. Standard datasets
Table 1 provides the characteristics of the six text document datasets used in this study: (CSTR2,
20Newsgroups, Classic4) in sites.labic.icmc.usp.br/text-collections, (tr12, tr41 and Wap) in glaros.dtc.umn.edu/
gkhome/fetch/sw/cluto/datasets.tar.gz.
Table 1. Text document dataset characteristics
ID Datasets Number of documents (d) Number of features or terms (t) Number of clusters (K)
DS1 CSTR 299 1725 4
DS2 20Newsgroups 300 2275 3
DS3 tr12 313 5329 8
DS4 tr41 878 6743 10
DS5 Wap 1560 7512 20
DS6 Classic4 2000 6500 4
3.2. Results and discussion
The results obtained by MVOTDC are summarised in Table 2, and the parameter settings used in the
experiments are given in Table 3. The results are summarised in terms of precision, recall, F-measure, entropy
accuracy, and purity for the six datasets. The findings prove the validity and effectiveness of the proposed
MVOTDC in the distribution of text documents to the right clusters.
The results are also conducted to show the validity of the proposed method in comparison with three
well-known methods: GA, K-means and PSO. Table 3 shows the parameter setting values for each compared
algorithm. These parameter settings are used as suggested in [17].
A comparative analysis of K-means, GA, PSO and MVOTDC is provided in Table 2 in terms of
precision, recall, F-measure, entropy accuracy, and purity; the average values for each measure are recorded.
The results obtained by the K-means clustering algorithm are worse than those obtained by the other algorithms
for nearly all datasets. The possible justification is that K-means is a local search algorithm; therefore, it
is highly likely to fall in local optima due to its inability to explore the problem search space effectively.
Meanwhile, population-based metaheuristic algorithms, such as GA, PSO and MVOTDC, can explore different
areas in the search space simultaneously and can consequently achieve better exploration properties.
Table 2 also show that MVOTDC attains minimum entropy and maximum purity, precision, recall,
F-measure, accuracy for five datasets (i.e. DS1, DS2, DS3, DS4, DS6). This ability of the proposed MVOTDC
algorithm during the search in reaching the right balance between exploitation and exploration with a powerful
learning mechanism strengthens its performance in achieving impressive outcomes in comparison with the
other methods.
Table 2 provides the results of the F-measure for all compared methods, including MVOTDC. No-
tably, MVOTDC produces the best F-measure values for five datasets. Furthermore, GA, PSO and MVOTDC
outperform K-means in all the datasets.
From a different perspective, Table 2 also shows the accuracy of all compared algorithms. In general,
the results obtained by MVOTDC are better than those of the other methods. In fact, the results could be slightly
changed from one dataset to another due to the fact that clustering algorithms are normally highly sensitive to
the dataset search space. This is can be validated by the finding that MVOTDC obtains the best accuracy in five
datasets and the second-best for DS5.
The purity measure of clusters is another external evaluation. It measures the maximum class for each
cluster. In general, the closer the purity value to 1, the better the clustering solution. Table 2 shows the results
of the purity measure for all compared methods on all datasets. MVOTDC outperforms K-means, GA and
PSO in five datasets (i.e. DS1, DS2, DS3, DS4, DS6). The proposed algorithm obtains a 21.5% improvement
percentage for DS1 in accordance with K-means. For DS2, MVOTDC’s purity values show improvements of
Text documents clustering using... (Ammar Kamal Abasi)
6. 6366 Ì ISSN: 2088-8708
6.0%, 2.6% and 2.4% over those acquired by K-means, GA and PSO, respectively. Meanwhile, the obtained
improvements are 15.4%, 9.3%, 5.7%, 19.7%, 4.7%, 2.9%, 8.0%, 4.2% and 4.9% for text document standard
datasets DS3, DS4 and DS6. In summary, the results shown in Table2 reveal that MVOTDC outperforms all
compared algorithms in terms of cluster quality (i.e. F-measure and purity).
Entropy is another external measure used in evaluating and comparing the quality of clustering algo-
rithms. The entropy value is zero only when all documents in a single class are placed in a single cluster. In
this case, the one cluster solution is considered the best. Table 2 shows the entropy measure values obtained by
all the compared algorithms on the different datasets. The bigger the entropy value, the worse the clustering so-
lution. According to the results, MVOTDC provides low entropy values for most of the datasets, which means
that it performs better than the other algorithms and offers the best clustering solution. Notably, K-means pro-
duces the worst entropy measure for all datasets, whereas GA and PSO are again ranked in between MVOTDC
and K-means. The superior performance of MVOTDC is due to its explorative capability in the search space.
The objective function is determined by ADDC for all clustering algorithms so that the distance be-
tween the documents in each cluster is minimized. Figure 3 depicts the convergence trends of GA, PSO and
MVOTDC using ADDC values. The x-axis is the stream of iteration numbers, whereas the y-axis is the stream
of ADDC values. Notably, the convergence rate of MVOTDC is fairly fast for all datasets except DS5.
Table 2. Results of the Accuracy, Precision, Recall, F-measure, Purity and Entropy for K-means, GA, PSO,
and MVOTDC algorithms over 30 independent runs for DS1, DS2, DS3, DS4, D65 ,and DS6
Dataset Measure Optimization algorithms and techniques
K-means [17] GA [29] PSO [12] MVOTDC
DS1 Accuracy 0.3573 0.3398 0.4355 0.4593
Precision 0.4091 0.4416 0.5340 0.571
Recall 0.3091 0.3417 0.4359 0.4829
F-Measure 0.3459 0.3886 0.4819 0.5243
Purity 0.3524 0.4050 0.4953 0.5684
Entropy 0.8201 0.7170 0.6199 0.5206
DS2 Accuracy 0.3180 0.3675 0.3498 0.4044
Precision 0.3121 0.4209 0.4134 0.4391
Recall 0.3099 0.3676 0.3496 0.384
F-Measure 0.3406 0.3935 0.3803 0.4109
Purity 0.3741 0.4080 0.4096 0.4343
Entropy 0.8028 0.7546 0.7722 0.7120
DS3 Accuracy 0.2971 0.3676 0.4075 0.4485
Precision 0.3522 0.4128 0.4297 0.5075
Recall 0.2944 0.3549 0.4263 0.4398
F-Measure 0.3221 0.3826 0.4277 0.4705
Purity 0.3907 0.4512 0.4877 0.5448
Entropy 0.7137 0.6233 0.5719 0.5224
DS4 Accuracy 0.4125 0.4320 0.4870 0.4630
Precision 0.3944 0.4140 0.4505 0.4568
Recall 0.3812 0.4007 0.4496 0.4418
F-Measure 0.3876 0.4071 0.4497 0.4568
Purity 0.4107 0.5602 0.5789 0.6081
Entropy 0.5874 0.5469 0.5391 0.5355
DS5 Accuracy 0.5011 0.5316 0.5622 0.5291
Precision 0.4626 0.5313 0.5249 0.5213
Recall 0.4010 0.4705 0.4810 0.4496
F-Measure 0.4314 0.4997 0.5016 0.4831
Purity 0.4759 0.4916 0.6124 0.6069
Entropy 0.7043 0.6216 0.5765 0.6625
DS6 Accuracy 0.5858 0.6620 0.6363 0.7042
Precision 0.5698 0.6725 0.6603 0.6919
Recall 0.5259 0.6319 0.6163 0.6843
F-Measure 0.5471 0.6518 0.6377 0.6880
Purity 0.5938 0.6319 0.6242 0.6742
Entropy 0.5600 0.5780 0.5306 0.5112
Int J Elec & Comp Eng, Vol. 10, No. 6, December 2020 : 6361 – 6369
7. Int J Elec & Comp Eng ISSN: 2088-8708 Ì 6367
Table 3. Parametric values for different variants of TDC algorithms
Algorithm Parameters Value
All Optimization algorithms Population size 60
All Optimization algorithms Maximum number of iteration 1000
All Optimization algorithms runs 30
proposed method (MVOTDC) WEP Max 1
proposed method (MVOTDC) WEP Min 0.2
proposed method (MVOTDC) p 6
GA C rossover probability 0.80
GA Mutation probability 0.02
PSO Maximum inertia weight 0.9
PSO Minimum inertia weight 0.2
PSO C1 2
PSO C2 2
0 100 200 300 400 500 600 700 800 900 1000
Number of iterations
0.019
0.0192
0.0194
0.0196
0.0198
0.02
0.0202
0.0204
0.0206
0.0208
0.021
ADDC
DS1
GA
PSO
MVO
0 100 200 300 400 500 600 700 800 900 1000
Number of iterations
0.022
0.0225
0.023
0.0235
0.024
0.0245
0.025
0.0255
0.026
ADDC
DS2
GA
PSO
MVO
0 100 200 300 400 500 600 700 800 900 1000
Number of iterations
0.14
0.15
0.16
0.17
0.18
0.19
0.2
0.21
0.22
0.23
0.24
ADDC
DS3
GA
PSO
MVO
0 100 200 300 400 500 600 700 800 900 1000
Number of iterations
0.062
0.064
0.066
0.068
0.07
0.072
0.074
0.076
0.078
0.08
ADDC
DS4
GA
PSO
MVO
0 100 200 300 400 500 600 700 800 900 1000
Number of iterations
0.075
0.08
0.085
0.09
0.095
0.1
ADDC
DS5
GA
PSO
MVO
0 100 200 300 400 500 600 700 800 900 1000
Number of iterations
5.7
5.8
5.9
6
6.1
6.2
6.3
ADDC
10-3 DS6
GA
PSO
MVO
Figure 3. Convergence characteristics of GA, PSO and MVOTDC on datasets D1, D2, D3, D4, D5 and D6
It is worth emphasizing the MVOTDC can be used to address specific optimization problems such as
EEG signals denoising [30], gene selection problem [31], and power scheduling problems [32]. Despite the
MVOTDC’s superiority among the competitive algorithms, MVOTDC remains sensitive to the characteristics
of the datasets, making it difficult to predict its behavior on new datasets while implemented.
Text documents clustering using... (Ammar Kamal Abasi)
8. 6368 Ì ISSN: 2088-8708
4. CONCLUSION AND FUTURE WORK
This paper proposes a metaheuristic optimization algorithm called multi-verse optimizer (MVO) for
solving the text document clustering (TDC) problem, i.e. MVOTDC. This method introduces a new strategy of
sharing information between solutions on the basis of an objective function and learns from the best solution
instead of the global best (i.e. all solutions). The convergence of the results of MVOTDC is impressive due
to the method’s achievement of the appropriate balance between exploitation and exploration search during
each run.
The proposed MVOTDC is evaluated using six text document datasets with various sizes and com-
plexities. The numbers of documents and clusters in each dataset are given. The quality of the obtained results
is assessed using six measures: precision, recall, F-measure, entropy accuracy, and purity.
These measures are also used for a comparative evaluation in which three well-known clustering al-
gorithms are used: K-means, genetic algorithm (GA) and particle swarm optimisation (PSO). For all measures,
the results obtained by MVOTDC are significantly better than those produced by the three compared methods.
In terms of computational time, MVOTDC is slower than K-means and requires nearly the same computa-
tional time as GA and PSO. Therefore, MVOTDC can be considered an efficient clustering method for the text
clustering domain.
Given the successful outcomes of MVO for the TDC problem, MVOTDC can be implemented for
different types of clustering problems. MVO can also be further improved by the addition or modification
of its operators so that it can address other discrete optimisation problems, such as scheduling. In addition,
datasets other than those used in this work can be used in future studies. In addition, hybridized the MVO with
local search strategies in order to improve initial solutions and the exploitation capability during optimization
process.
REFERENCES
[1] N. Saini, S. Saha, and P. Bhattacharyya, “Automatic scientific document clustering using self-organized
multi-objective differential evolution,” Cognitive Computation, pp. 1–23, 2018.
[2] I. Arın, M. K. Erpam, and Y. Saygın, “I-twec: Interactive clustering tool for twitter,” Expert Systems with
Applications, vol. 96, pp. 1–13, 2018.
[3] A. K. Abasi, A. T. Khader, M. A. Al-Betar, S. Naim, S. N. Makhadmeh, and Z. A. A. Alyasseri, “Link-
based multi-verse optimizer for text documents clustering,” Applied Soft Computing, vol. 87, 2020.
[4] W. Song, W. Ma, and Y. Qiao, “Particle swarm optimization algorithm with environmental factors for
clustering analysis,” Soft Computing, vol. 21, no. 2, pp. 283–293, 2017.
[5] Z. A. A. Alyasseri, A. T. Khader, M. A. Al-Betar, M. A. Awadallah, and X.-S. Yang, “Variants of
the flower pollination algorithm: a review,” Nature-Inspired Algorithms and Applied Optimization,
pp. 91–118, 2018.
[6] A. K. Abasi, A. T. Khader, M. A. Al-Betar, S. Naim, S. N. Makhadmeh, and Z. A. A. Alyasseri,
“A text feature selection technique based on binary multi-verse optimizer for text clustering,” IEEE Jordan
International Joint Conference on Electrical Engineering and Information Technology, pp. 1–6, 2019.
[7] J.-H. Jiang, J.-H. Wang, X. Chu, and R.-Q. Yu, “Clustering data using a modified integer genetic algorithm
(iga),” Analytica Chimica Acta, vol. 354, no. 1, pp. 263–274, 1997.
[8] R. Forsati, M. Mahdavi, M. Shamsfard, and M. R. Meybodi, “Efficient stochastic algorithms for document
clustering,” Information Sciences, vol. 220, pp. 269–291, 2013.
[9] O. A. Alomari, A. T. Khader, M. A. Al-Betar, and M. A. Awadallah, “A novel gene selection method
using modified mrmr and hybrid bat-inspired algorithm with -hill climbing,” Applied Intelligence, vol.
48, no. 11, pp. 4429–4447, 2018.
[10] N. Saini, S. Saha, A. Harsh, and P. Bhattacharyya, “Sophisticated som based genetic operators in multi-
objective clustering framework,” Applied Intelligence, pp. 1–20, 2018.
[11] M. Mavrovouniotis, C. Li, and S. Yang, “A survey of swarm intelligence for dynamic optimization: Al-
gorithms and applications,” Swarm and Evolutionary Computation, vol. 33, pp. 1–17, 2017.
[12] T. Cura, “A particle swarm optimization approach to clustering,” Expert Systems with Applications,
vol. 39, no. 1, pp. 1582–1588, 2012.
[13] K. K. Bharti and P. K. Singh, “Chaotic gradient artificial bee colony for text clustering,” Soft Computing,
vol. 20, no. 3, pp. 1113–1126, 2016.
Int J Elec & Comp Eng, Vol. 10, No. 6, December 2020 : 6361 – 6369
9. Int J Elec & Comp Eng ISSN: 2088-8708 Ì 6369
[14] S. Mirjalili, S. M. Mirjalili, and A. Hatamlou, “Multi-verse optimizer: a nature-inspired algorithm for
global optimization,” Neural Computing and Applications, vol. 27, no. 2, pp. 495–513, 2016.
[15] A. Fathy and H. Rezk, “Multi-verse optimizer for identifying the optimal parameters of pemfc model,”
Energy, vol. 143, pp. 634–644, 2018.
[16] P. Kumar, S. Garg, A. Singh, S. Batra, N. Kumar, and I. You, “Mvo-based two-dimensional path planning
scheme for providing quality of service in uav environment,” IEEE Internet of Things Journal, 2018.
[17] A. K. Abasi, A. T. Khader, M. A. Al-Betar, S. Naim, and S. N. a. Alyasseri, Zaid Abdi Alkareem Makhad-
meh, “A novel hybrid multi-verse optimizer with k-means for text documents clustering,” Neural Com-
puting and Applications, 2020.
[18] H. Faris, M. A. Hassonah, A.-Z. Ala’M, S. Mirjalili, and I. Aljarah, “A multi-verse optimizer approach
for feature selection and optimizing svm parameters based on a robust system architecture,” Neural Com-
puting and Applications, pp. 1–15, 2017.
[19] I. Benmessahel, K. Xie, and M. Chellal, “A new evolutionary neural networks based on intrusion detection
systems using multiverse optimization,” Applied Intelligence, pp. 1–13, 2017.
[20] H.-S. Park and C.-H. Jun, “A simple and fast algorithm for k-medoids clustering,” Expert systems with
applications, vol. 36, no. 2, pp. 3336–3341, 2009.
[21] A. Huang, “Similarity measures for text document clustering,” Proceedings of the sixth new zealand
computer science research student conference (NZCSRSC2008), pp. 49–56, 2008.
[22] A. I. Kadhim, Y.-N. Cheah, and N. H. Ahamed, “Text document preprocessing and dimension reduc-
tion techniques for text document clustering,” 4th International Conference onArtificial Intelligence with
Applications in Engineering and Technology (ICAIET), pp. 69–73, 2014.
[23] R. Zhao and K. Mao, “Fuzzy bag-of-words model for document representation,” IEEE Transactions on
Fuzzy Systems, vol. 26, no. 2, pp. 794–804, 2018.
[24] K. K. Bharti and P. K. Singh, “Opposition chaotic fitness mutation based adaptive inertia weight bpso for
feature selection in text clustering,” Applied Soft Computing, vol. 43, pp. 20–34, 2016.
[25] J. Singh and V. Gupta, “A systematic review of text stemming techniques,” Artificial Intelligence Review,
vol. 48, no. 2, pp. 157–217, 2017.
[26] M. N. P. Katariya, M. Chaudhari, B. Subhani, G. Laxminarayana, K. Matey, M. A. Nikose, S. A. Tinkhede,
and S. Deshpande, “Text preprocessing for text mining using side information,” International Journal of
Computer Science and Mobile Applications, vol. 3, no. 1, pp. 01–05, 2015.
[27] D. Janiga, R. Czarnota, J. Stopa, P. Wojnarowski, and P. Kosowski, “Performance of nature inspired
optimization algorithms for polymer enhanced oil recovery process,” Journal of Petroleum Science and
Engineering, vol. 154, pp. 354–366, 2017.
[28] W. Song, Y. Qiao, S. C. Park, and X. Qian, “A hybrid evolutionary computation approach with its ap-
plication for optimizing text document clustering,” Expert Systems with Applications, vol. 42, no. 5,
pp. 2517–2524, 2015.
[29] D. Mustafi and G. Sahoo, “A hybrid approach using genetic algorithm and the differential evolution
heuristic for enhanced initialization of the k-means algorithm with applications in text clustering,” Soft
Computing, pp. 1–18, 2018.
[30] Z. A. A. Alyasseri, A. T. Khader, M. A. Al-Betar, A. K. Abasi, and S. N. Makhadmeh, “EEG signals de-
noising using optimal wavelet transform hybridized with efficient metaheuristic methods,” IEEE Access,
vol. 8, pp. 10 584–10 605, 2019.
[31] M. A. Al-Betar, O. A. Alomari, and S. M. Abu-Romman, “A triz-inspired bat algorithm for gene selection
in cancer classification,” Genomics, vol. 112, no. 1, pp. 114–126, 2020.
[32] S. N. Makhadmeh, A. T. Khader, M. A. Al-Betar, S. Naim, A. K. Abasi, and Z. A. A. Alyasseri,
“Optimization methods for power scheduling problems in smart home: Survey,” Renewable and Sus-
tainable, Energy Reviews, vol. 115, 2019.
Text documents clustering using... (Ammar Kamal Abasi)