In recent years, two new data clustering algorithms have been proposed. One of them is Affinity Propagation (AP). AP is a new data clustering technique that use iterative message passing and consider all data points as potential exemplars. Two important inputs of AP are a similarity matrix (SM) of the data and the parameter ”preference” p. Although the original AP algorithm has shown much success in data clustering, it still suffer from one limitation: it is not easy to determine the value of the parameter ”preference” p which can result an optimal clustering solution. To resolve this limitation, we propose a new model of the parameter ”preference” p, i.e. it is modeled based on the similarity distribution. Having the SM and p, Modified Adaptive AP (MAAP) procedure is running. MAAP procedure means that we omit the adaptive p-scanning algorithm as in original Adaptive-AP (AAP) procedure. Experimental results on random non-partition and partition data sets show that (i) the proposed algorithm, MAAP-DDP, is slower than original AP for random non-partition dataset, (ii) for random 4-partition dataset and real datasets the proposed algorithm has succeeded to identify clusters according to the number of dataset’s true labels with the execution times that are comparable with those original AP. Beside that the MAAP-DDP algorithm demonstrates more feasible and effective than original AAP procedure.
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...ijcseit
This document discusses various statistical analysis and feature engineering techniques that can be used for model building in machine learning algorithms. It describes how proper feature extraction through techniques like correlation analysis, principal component analysis, recursive feature elimination, and feature importance can help improve the accuracy of machine learning models. The document provides examples of applying different feature selection methods like univariate selection, recursive feature elimination, and principal component analysis on a diabetes dataset. It also explains the mathematics behind principal component analysis and how feature importance is estimated using an extra trees classifier. Overall, the document emphasizes how statistical analysis and feature engineering are important for effective model building in machine learning.
Opinion mining framework using proposed RB-bayes model for text classicationIJECEIAES
Information mining is a capable idea with incredible potential to anticipate future patterns and conduct. It alludes to the extraction of concealed information from vast data sets by utilizing procedures like factual examination, machine learning, grouping, neural systems and genetic algorithms. In naive baye’s, there exists a problem of zero likelihood. This paper proposed RB-Bayes method based on baye’s theorem for prediction to remove problem of zero likelihood. We also compare our method with few existing methods i.e. naive baye’s and SVM. We demonstrate that this technique is better than some current techniques and specifically can analyze data sets in better way. At the point when the proposed approach is tried on genuine data-sets, the outcomes got improved accuracy in most cases. RB-Bayes calculation having precision 83.333.
A Combined Approach for Feature Subset Selection and Size Reduction for High ...IJERA Editor
selection of relevant feature from a given set of feature is one of the important issues in the field of
data mining as well as classification. In general the dataset may contain a number of features however it is not
necessary that the whole set features are important for particular analysis of decision making because the
features may share the common information‟s and can also be completely irrelevant to the undergoing
processing. This generally happen because of improper selection of features during the dataset formation or
because of improper information availability about the observed system. However in both cases the data will
contain the features that will just increase the processing burden which may ultimately cause the improper
outcome when used for analysis. Because of these reasons some kind of methods are required to detect and
remove these features hence in this paper we are presenting an efficient approach for not just removing the
unimportant features but also the size of complete dataset size. The proposed algorithm utilizes the information
theory to detect the information gain from each feature and minimum span tree to group the similar features
with that the fuzzy c-means clustering is used to remove the similar entries from the dataset. Finally the
algorithm is tested with SVM classifier using 35 publicly available real-world high-dimensional dataset and the
results shows that the presented algorithm not only reduces the feature set and data lengths but also improves the
performances of the classifier.
The document analyzes crop yield data from spatial locations in Guntur District, Andhra Pradesh, India using hybrid data mining techniques. It first applies k-means clustering to the dataset, producing 5 clusters. It then applies the J48 classification algorithm to the clustered data, resulting in a decision tree that predicts cluster membership based on attributes like crop type, irrigated area, and latitude. Analysis found irrigated areas of cotton and chilies increased from 2007-2008 to 2011-2012. Association rule mining on the clustered data also found relationships between productivity and location attributes. The hybrid approach of clustering followed by classification effectively analyzed the spatial agricultural data.
Comparison of Cost Estimation Methods using Hybrid Artificial Intelligence on...IJERA Editor
Cost estimating at schematic design stage as the basis of project evaluation, engineering design, and cost
management, plays an important role in project decision under a limited definition of scope and constraints in
available information and time, and the presence of uncertainties. The purpose of this study is to compare the
performance of cost estimation models of two different hybrid artificial intelligence approaches: regression
analysis-adaptive neuro fuzzy inference system (RANFIS) and case based reasoning-genetic algorithm (CBRGA)
techniques. The models were developed based on the same 50 low-cost apartment project datasets in
Indonesia. Tested on another five testing data, the models were proven to perform very well in term of accuracy.
A CBR-GA model was found to be the best performer but suffered from disadvantage of needing 15 cost drivers
if compared to only 4 cost drivers required by RANFIS for on-par performance.
This document compares hierarchical and non-hierarchical clustering algorithms. It summarizes four clustering algorithms: K-Means, K-Medoids, Farthest First Clustering (hierarchical algorithms), and DBSCAN (non-hierarchical algorithm). It describes the methodology of each algorithm and provides pseudocode. It also describes the datasets used to evaluate the performance of the algorithms and the evaluation metrics. The goal is to compare the performance of the clustering methods on different datasets.
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithmaciijournal
In this paper we attempt to solve an automatic clustering problem by optimizing multiple objectives such as automatic k-determination and a set of cluster validity indices concurrently. The proposed automatic clustering technique uses the most recent optimization algorithm Jaya as an underlying optimization stratagem. This evolutionary technique always aims to attain global best solution rather than a local best solution in larger datasets. The explorations and exploitations imposed on the proposed work results to detect the number of automatic clusters, appropriate partitioning present in data sets and mere optimal values towards CVIs frontiers. Twelve datasets of different intricacy are used to endorse the performance of aimed algorithm. The experiments lay bare that the conjectural advantages of multi objective clustering optimized with evolutionary approaches decipher into realistic and scalable performance paybacks.
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...ijnlc
The tremendous increase in the amount of available research documents impels researchers to propose topic models to extract the latent semantic themes of a documents collection. However, how to extract the hidden topics of the documents collection has become a crucial task for many topic model applications. Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of documents collection increases. In this paper, the Correlated Topic Model with variational ExpectationMaximization algorithm is implemented in MapReduce framework to solve the scalability problem. The proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed approach has a comparable performance in terms of topic coherences with LDA implemented in MapReduce framework.
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...ijcseit
This document discusses various statistical analysis and feature engineering techniques that can be used for model building in machine learning algorithms. It describes how proper feature extraction through techniques like correlation analysis, principal component analysis, recursive feature elimination, and feature importance can help improve the accuracy of machine learning models. The document provides examples of applying different feature selection methods like univariate selection, recursive feature elimination, and principal component analysis on a diabetes dataset. It also explains the mathematics behind principal component analysis and how feature importance is estimated using an extra trees classifier. Overall, the document emphasizes how statistical analysis and feature engineering are important for effective model building in machine learning.
Opinion mining framework using proposed RB-bayes model for text classicationIJECEIAES
Information mining is a capable idea with incredible potential to anticipate future patterns and conduct. It alludes to the extraction of concealed information from vast data sets by utilizing procedures like factual examination, machine learning, grouping, neural systems and genetic algorithms. In naive baye’s, there exists a problem of zero likelihood. This paper proposed RB-Bayes method based on baye’s theorem for prediction to remove problem of zero likelihood. We also compare our method with few existing methods i.e. naive baye’s and SVM. We demonstrate that this technique is better than some current techniques and specifically can analyze data sets in better way. At the point when the proposed approach is tried on genuine data-sets, the outcomes got improved accuracy in most cases. RB-Bayes calculation having precision 83.333.
A Combined Approach for Feature Subset Selection and Size Reduction for High ...IJERA Editor
selection of relevant feature from a given set of feature is one of the important issues in the field of
data mining as well as classification. In general the dataset may contain a number of features however it is not
necessary that the whole set features are important for particular analysis of decision making because the
features may share the common information‟s and can also be completely irrelevant to the undergoing
processing. This generally happen because of improper selection of features during the dataset formation or
because of improper information availability about the observed system. However in both cases the data will
contain the features that will just increase the processing burden which may ultimately cause the improper
outcome when used for analysis. Because of these reasons some kind of methods are required to detect and
remove these features hence in this paper we are presenting an efficient approach for not just removing the
unimportant features but also the size of complete dataset size. The proposed algorithm utilizes the information
theory to detect the information gain from each feature and minimum span tree to group the similar features
with that the fuzzy c-means clustering is used to remove the similar entries from the dataset. Finally the
algorithm is tested with SVM classifier using 35 publicly available real-world high-dimensional dataset and the
results shows that the presented algorithm not only reduces the feature set and data lengths but also improves the
performances of the classifier.
The document analyzes crop yield data from spatial locations in Guntur District, Andhra Pradesh, India using hybrid data mining techniques. It first applies k-means clustering to the dataset, producing 5 clusters. It then applies the J48 classification algorithm to the clustered data, resulting in a decision tree that predicts cluster membership based on attributes like crop type, irrigated area, and latitude. Analysis found irrigated areas of cotton and chilies increased from 2007-2008 to 2011-2012. Association rule mining on the clustered data also found relationships between productivity and location attributes. The hybrid approach of clustering followed by classification effectively analyzed the spatial agricultural data.
Comparison of Cost Estimation Methods using Hybrid Artificial Intelligence on...IJERA Editor
Cost estimating at schematic design stage as the basis of project evaluation, engineering design, and cost
management, plays an important role in project decision under a limited definition of scope and constraints in
available information and time, and the presence of uncertainties. The purpose of this study is to compare the
performance of cost estimation models of two different hybrid artificial intelligence approaches: regression
analysis-adaptive neuro fuzzy inference system (RANFIS) and case based reasoning-genetic algorithm (CBRGA)
techniques. The models were developed based on the same 50 low-cost apartment project datasets in
Indonesia. Tested on another five testing data, the models were proven to perform very well in term of accuracy.
A CBR-GA model was found to be the best performer but suffered from disadvantage of needing 15 cost drivers
if compared to only 4 cost drivers required by RANFIS for on-par performance.
This document compares hierarchical and non-hierarchical clustering algorithms. It summarizes four clustering algorithms: K-Means, K-Medoids, Farthest First Clustering (hierarchical algorithms), and DBSCAN (non-hierarchical algorithm). It describes the methodology of each algorithm and provides pseudocode. It also describes the datasets used to evaluate the performance of the algorithms and the evaluation metrics. The goal is to compare the performance of the clustering methods on different datasets.
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithmaciijournal
In this paper we attempt to solve an automatic clustering problem by optimizing multiple objectives such as automatic k-determination and a set of cluster validity indices concurrently. The proposed automatic clustering technique uses the most recent optimization algorithm Jaya as an underlying optimization stratagem. This evolutionary technique always aims to attain global best solution rather than a local best solution in larger datasets. The explorations and exploitations imposed on the proposed work results to detect the number of automatic clusters, appropriate partitioning present in data sets and mere optimal values towards CVIs frontiers. Twelve datasets of different intricacy are used to endorse the performance of aimed algorithm. The experiments lay bare that the conjectural advantages of multi objective clustering optimized with evolutionary approaches decipher into realistic and scalable performance paybacks.
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...ijnlc
The tremendous increase in the amount of available research documents impels researchers to propose topic models to extract the latent semantic themes of a documents collection. However, how to extract the hidden topics of the documents collection has become a crucial task for many topic model applications. Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of documents collection increases. In this paper, the Correlated Topic Model with variational ExpectationMaximization algorithm is implemented in MapReduce framework to solve the scalability problem. The proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed approach has a comparable performance in terms of topic coherences with LDA implemented in MapReduce framework.
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...IJAEMSJORNAL
The current trend in the industry is to analyze large data sets and apply data mining, machine learning techniques to identify a pattern. But the challenges with huge data sets are the high dimensions associated with it. Sometimes in data analytics applications, large amounts of data produce worse performance. Also, most of the data mining algorithms are implemented column wise and too many columns restrict the performance and make it slower. Therefore, dimensionality reduction is an important step in data analysis. Dimensionality reduction is a technique that converts high dimensional data into much lower dimension, such that maximum variance is explained within the first few dimensions. This paper focuses on multivariate statistical and artificial neural networks techniques for data reduction. Each method has a different rationale to preserve the relationship between input parameters during analysis. Principal Component Analysis which is a multivariate technique and Self Organising Map a neural network technique is presented in this paper. Also, a hierarchical clustering approach has been applied to the reduced data set. A case study of Air quality measurement has been considered to evaluate the performance of the proposed techniques.
Comparative study of classification algorithm for text based categorizationeSAT Journals
Abstract
Text categorization is a process in data mining which assigns predefined categories to free-text documents using machine
learning techniques. Any document in the form of text, image, music, etc. can be classified using some categorization techniques.
It provides conceptual views of the collected documents and has important applications in the real world. Text based
categorization is made use of for document classification with pattern recognition and machine learning. Advantages of a number
of classification algorithms have been studied in this paper to classify documents. An example of these algorithms is: Naive Bayes'
algorithm, K-Nearest Neighbor, Decision Tree etc. This paper presents a comparative study of advantages and disadvantages of
the above mentioned classification algorithm.
Keywords: Data Mining, Text Mining, Text Categorization, Machine Learning, Pattern Analysis, Naive Bayes’, KNN,
Decision Tree.
A Novel Approach to Mathematical Concepts in Data Miningijdmtaiir
-This paper describes three different fundamental
mathematical programming approaches that are relevant to
data mining. They are: Feature Selection, Clustering and
Robust Representation. This paper comprises of two clustering
algorithms such as K-mean algorithm and K-median
algorithms. Clustering is illustrated by the unsupervised
learning of patterns and clusters that may exist in a given
databases and useful tool for Knowledge Discovery in
Database (KDD). The results of k-median algorithm are used
to collecting the blood cancer patient from a medical database.
K-mean clustering is a data mining/machine learning algorithm
used to cluster observations into groups of related observations
without any prior knowledge of those relationships. The kmean algorithm is one of the simplest clustering techniques
and it is commonly used in medical imaging, biometrics and
related fields.
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithmaciijournal
The document describes an automatic unsupervised data classification method using the Jaya evolutionary algorithm. It proposes using Jaya to optimize multiple cluster validity indices (CVIs) simultaneously to determine the optimal number of clusters and cluster assignments. Twelve real-world datasets from different domains are used to evaluate the method. The results show that the proposed AutoJAYA algorithm is able to accurately detect the number of clusters in each dataset and achieve good performance according to various CVIs, demonstrating its effectiveness at automatic unsupervised data classification.
Top-K Dominating Queries on Incomplete Data with Prioritiesijtsrd
This document discusses algorithms for finding the top-K dominating queries on incomplete datasets. It proposes using a skyline-based algorithm that incorporates priority values for each dimension. This allows the algorithm to determine dominance even when the values are missing for some dimensions. It works by bucketing the data based on bit representations, finding the local skylines within each bucket, and then calculating scores for objects based on their dominance over other objects while considering the priority of dimensions. The top-K objects with the highest scores are then returned as the results. This approach provides more accurate outputs for applications like movie recommendations by allowing users to specify dimension priorities.
In the present day huge amount of data is generated in every minute and transferred frequently. Although
the data is sometimes static but most commonly it is dynamic and transactional. New data that is being
generated is getting constantly added to the old/existing data. To discover the knowledge from this
incremental data, one approach is to run the algorithm repeatedly for the modified data sets which is time
consuming. Again to analyze the datasets properly, construction of efficient classifier model is necessary.
The objective of developing such a classifier is to classify unlabeled dataset into appropriate classes. The
paper proposes a dimension reduction algorithm that can be applied in dynamic environment for
generation of reduced attribute set as dynamic reduct, and an optimization algorithm which uses the
reduct and build up the corresponding classification system. The method analyzes the new dataset, when it
becomes available, and modifies the reduct accordingly to fit the entire dataset and from the entire data
set, interesting optimal classification rule sets are generated. The concepts of discernibility relation,
attribute dependency and attribute significance of Rough Set Theory are integrated for the generation of
dynamic reduct set, and optimal classification rules are selected using PSO method, which not only
reduces the complexity but also helps to achieve higher accuracy of the decision system. The proposed
method has been applied on some benchmark dataset collected from the UCI repository and dynamic
reduct is computed, and from the reduct optimal classification rules are also generated. Experimental
result shows the efficiency of the proposed method.
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...csandit
There is an exhaustive study around the area of engine design that covers different methods that try to reduce costs of production and to optimize the performance of these engines.
Mathematical methods based in statistics, self-organized maps and neural networks reach the best results in these designs but there exists the problem that configuration of these methods is
not an easy work due the high number of parameters that have to be measured.
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
Data partitioning methods are used to partition the data values with similarity. Similarity
measures are used to estimate transaction relationships. Hierarchical clustering model produces tree
structured results. Partitioned clustering produces results in grid format. Text documents are
unstructured data values with high dimensional attributes. Document clustering group ups unlabeled text
documents into meaningful clusters. Traditional clustering methods require cluster count (K) for the
document grouping process. Clustering accuracy degrades drastically with reference to the unsuitable
cluster count.
Textual data elements are divided into two types’ discriminative words and nondiscriminative
words. Only discriminative words are useful for grouping documents. The involvement of
nondiscriminative words confuses the clustering process and leads to poor clustering solution in return.
A variation inference algorithm is used to infer the document collection structure and partition of
document words at the same time. Dirichlet Process Mixture (DPM) model is used to partition
documents. DPM clustering model uses both the data likelihood and the clustering property of the
Dirichlet Process (DP). Dirichlet Process Mixture Model for Feature Partition (DPMFP) is used to
discover the latent cluster structure based on the DPM model. DPMFP clustering is performed without
requiring the number of clusters as input.
Document labels are used to estimate the discriminative word identification process. Concept
relationships are analyzed with Ontology support. Semantic weight model is used for the document
similarity analysis. The system improves the scalability with the support of labels and concept relations
for dimensionality reduction process.
With these components in place, we present the Data
Science Machine — an automated system for generating
predictive models from raw data. It starts with a relational
database and automatically generates features to be used
for predictive modeling.
This document provides an overview of different techniques for clustering categorical data. It discusses various clustering algorithms that have been used for categorical data, including K-modes, ROCK, COBWEB, and EM algorithms. It also reviews more recently developed algorithms for categorical data clustering, such as algorithms based on particle swarm optimization, rough set theory, and feature weighting schemes. The document concludes that clustering categorical data remains an important area of research, with opportunities to develop techniques that initialize cluster centers better.
ENTROPY-COST RATIO MAXIMIZATION MODEL FOR EFFICIENT STOCK PORTFOLIO SELECTION...cscpconf
This paper introduces a new stock portfolio selection model in non-stochastic environment.Following the principle of maximum entropy, a new entropy-cost ratio function is introduced as
the objective function. The uncertain returns, risks and ividends of the securities are considered as interval numbers. Along with the objective function, eight different types of constraints are used in the model to convert it into a pragmatic one. Three different models have been proposed by defining the future inancial market optimistically, pessimistically and in hecombined form to model the portfolio selection problem. To illustrate the effectiveness and tractability of the proposed models, these are tested on a set of data from Bombay Stock Exchange (BSE). The solution has been done by genetic algorithm.
This document summarizes a research paper that proposes a dynamic approach to improving the k-means clustering algorithm. The proposed approach aims to address two weaknesses of the standard k-means algorithm: its requirement of prior knowledge of the number of clusters k, and its sensitivity to initialization. The approach determines initial cluster centroids by segmenting the data space and selecting high-frequency segments. It then uses the silhouette validity index to dynamically determine the optimal number of clusters k, rather than requiring the user to specify k. The approach is compared to the standard k-means algorithm and other modified approaches, and is shown to improve initial center selection and reduce computation time.
A Novel Approach for Clustering Big Data based on MapReduce IJECEIAES
Clustering is one of the most important applications of data mining. It has attracted attention of researchers in statistics and machine learning. It is used in many applications like information retrieval, image processing and social network analytics etc. It helps the user to understand the similarity and dissimilarity between objects. Cluster analysis makes the users understand complex and large data sets more clearly. There are different types of clustering algorithms analyzed by various researchers. Kmeans is the most popular partitioning based algorithm as it provides good results because of accurate calculation on numerical data. But Kmeans give good results for numerical data only. Big data is combination of numerical and categorical data. Kprototype algorithm is used to deal with numerical as well as categorical data. Kprototype combines the distance calculated from numeric and categorical data. With the growth of data due to social networking websites, business transactions, scientific calculation etc., there is vast collection of structured, semi-structured and unstructured data. So, there is need of optimization of Kprototype so that these varieties of data can be analyzed efficiently.In this work, Kprototype algorithm is implemented on MapReduce in this paper. Experiments have proved that Kprototype implemented on Mapreduce gives better performance gain on multiple nodes as compared to single node. CPU execution time and speedup are used as evaluation metrics for comparison.Intellegent splitter is proposed in this paper which splits mixed big data into numerical and categorical data. Comparison with traditional algorithms proves that proposed algorithm works better for large scale of data.
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...IJRES Journal
The document presents a mathematical programming approach for selecting important variables in cluster analysis. It formulates a nonlinear binary model to minimize the distance between observations within clusters, using indicator variables to select important variables. The model is applied to a sample dataset of 30 observations across 5 variables, correctly identifying variables 3, 4 and 5 as most important for clustering the observations into two groups. The results are compared to an existing variable selection heuristic, with the mathematical programming approach achieving a 100% correct classification versus 97% for the other method.
Using particle swarm optimization to solve test functions problemsriyaniaes
In this paper the benchmarking functions are used to evaluate and check the particle swarm optimization (PSO) algorithm. However, the functions utilized have two dimension but they selected with different difficulty and with different models. In order to prove capability of PSO, it is compared with genetic algorithm (GA). Hence, the two algorithms are compared in terms of objective functions and the standard deviation. Different runs have been taken to get convincing results and the parameters are chosen properly where the Matlab software is used. Where the suggested algorithm can solve different engineering problems with different dimension and outperform the others in term of accuracy and speed of convergence.
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Seval Çapraz
This document analyzes a dataset of diabetes records from 130 US hospitals from 1999-2008 using various statistical data analysis and machine learning techniques. It first performs dimensionality reduction using principal component analysis (PCA) and multidimensional scaling (MDS). It then clusters the data using hierarchical clustering and k-means clustering. Cluster validity is assessed using precision. Spectral clustering is also applied and validated using Dunn and Davies-Bouldin indexes, with complete linkage diameter performing best.
The document discusses machine learning algorithms including logistic regression, random forests, support vector machines (SVM), and analysis of variance (ANOVA). It provides descriptions of how each algorithm works, its advantages, and examples of applications. Logistic regression uses a sigmoid function to predict binary outcomes. Random forests create an ensemble of decision trees to make classifications. SVM finds the optimal separating hyperplane between classes. ANOVA splits variability in a data set into systematic and random factors.
The D-basis Algorithm for Association Rules of High ConfidenceITIIIndustries
We develop a new approach for distributed computing of the association rules of high confidence on the attributes/columns of a binary table. It is derived from the D-basis algorithm developed by K.Adaricheva and J.B.Nation (Theoretical Computer Science, 2017), which runs multiple times on sub-tables of a given binary table, obtained by removing one or more rows. The sets of rules retrieved at these runs are then aggregated. This allows us to obtain a basis of association rules of high confidence, which can be used for ranking all attributes of the table with respect to a given fixed attribute. This paper focuses on some algorithmic details and the technical implementation of the new algorithm. Results are given for tests performed on random, synthetic and real data
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACHcscpconf
Due to the intangible nature of “software”, accurate and reliable software effort estimation is a challenge in the software Industry. It is unlikely to expect very accurate estimates of software
development effort because of the inherent uncertainty in software development projects and the complex and dynamic interaction of factors that impact software development. Heterogeneity exists in the software engineering datasets because data is made available from diverse sources.
This can be reduced by defining certain relationship between the data values by classifying them into different clusters. This study focuses on how the combination of clustering and
regression techniques can reduce the potential problems in effectiveness of predictive efficiency due to heterogeneity of the data. Using a clustered approach creates the subsets of data having a degree of homogeneity that enhances prediction accuracy. It was also observed in this study that ridge regression performs better than other regression techniques used in the analysis.
Performance Comparision of Machine Learning AlgorithmsDinusha Dilanka
In this paper Compare the performance of two
classification algorithm. I t is useful to differentiate
algorithms based on computational performance rather
than classification accuracy alone. As although
classification accuracy between the algorithms is similar,
computational performance can differ significantly and it
can affect to the final results. So the objective of this paper
is to perform a comparative analysis of two machine
learning algorithms namely, K Nearest neighbor,
classification and Logistic Regression. In this paper it
was considered a large dataset of 7981 data points and 112
features. Then the performance of the above mentioned
machine learning algorithms are examined. In this paper
the processing time and accuracy of the different machine
learning techniques are being estimated by considering the
collected data set, over a 60% for train and remaining
40% for testing. The paper is organized as follows. In
Section I, introduction and background analysis of the
research is included and in section II, problem statement.
In Section III, our application and data analyze Process,
the testing environment, and the Methodology of our
analysis are being described briefly. Section IV comprises
the results of two algorithms. Finally, the paper concludes
with a discussion of future directions for research by
eliminating the problems existing with the current
research methodology.
A new model for iris data set classification based on linear support vector m...IJECEIAES
1. The authors propose a new model for classifying the iris data set using a linear support vector machine (SVM) classifier with genetic algorithm optimization of the SVM's C and gamma parameters.
2. Principal component analysis was used to reduce the iris data set features from four to three before classification.
3. The genetic algorithm was shown to optimize the SVM parameters, achieving 98.7% accuracy on the iris data set classification compared to 95.3% accuracy without parameter optimization.
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...IJCSES Journal
Scrutiny for presage is the era of advance statistics where accuracy matter the most. Commensurate between algorithms with statistical implementation provides better consequence in terms of accurate prediction by using data sets. Prolific usage of algorithms lead towards the simplification of mathematical models, which provide less manual calculations. Presage is the essence of data science and machine learning requisitions that impart control over situations. Implementation of any dogmas require proper feature extraction which helps in the proper model building that assist in precision. This paper is predominantly based on different statistical analysis which includes correlation significance and proper categorical data distribution using feature engineering technique that unravel accuracy of different models of machine learning algorithms.
Influence over the Dimensionality Reduction and Clustering for Air Quality Me...IJAEMSJORNAL
The current trend in the industry is to analyze large data sets and apply data mining, machine learning techniques to identify a pattern. But the challenges with huge data sets are the high dimensions associated with it. Sometimes in data analytics applications, large amounts of data produce worse performance. Also, most of the data mining algorithms are implemented column wise and too many columns restrict the performance and make it slower. Therefore, dimensionality reduction is an important step in data analysis. Dimensionality reduction is a technique that converts high dimensional data into much lower dimension, such that maximum variance is explained within the first few dimensions. This paper focuses on multivariate statistical and artificial neural networks techniques for data reduction. Each method has a different rationale to preserve the relationship between input parameters during analysis. Principal Component Analysis which is a multivariate technique and Self Organising Map a neural network technique is presented in this paper. Also, a hierarchical clustering approach has been applied to the reduced data set. A case study of Air quality measurement has been considered to evaluate the performance of the proposed techniques.
Comparative study of classification algorithm for text based categorizationeSAT Journals
Abstract
Text categorization is a process in data mining which assigns predefined categories to free-text documents using machine
learning techniques. Any document in the form of text, image, music, etc. can be classified using some categorization techniques.
It provides conceptual views of the collected documents and has important applications in the real world. Text based
categorization is made use of for document classification with pattern recognition and machine learning. Advantages of a number
of classification algorithms have been studied in this paper to classify documents. An example of these algorithms is: Naive Bayes'
algorithm, K-Nearest Neighbor, Decision Tree etc. This paper presents a comparative study of advantages and disadvantages of
the above mentioned classification algorithm.
Keywords: Data Mining, Text Mining, Text Categorization, Machine Learning, Pattern Analysis, Naive Bayes’, KNN,
Decision Tree.
A Novel Approach to Mathematical Concepts in Data Miningijdmtaiir
-This paper describes three different fundamental
mathematical programming approaches that are relevant to
data mining. They are: Feature Selection, Clustering and
Robust Representation. This paper comprises of two clustering
algorithms such as K-mean algorithm and K-median
algorithms. Clustering is illustrated by the unsupervised
learning of patterns and clusters that may exist in a given
databases and useful tool for Knowledge Discovery in
Database (KDD). The results of k-median algorithm are used
to collecting the blood cancer patient from a medical database.
K-mean clustering is a data mining/machine learning algorithm
used to cluster observations into groups of related observations
without any prior knowledge of those relationships. The kmean algorithm is one of the simplest clustering techniques
and it is commonly used in medical imaging, biometrics and
related fields.
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithmaciijournal
The document describes an automatic unsupervised data classification method using the Jaya evolutionary algorithm. It proposes using Jaya to optimize multiple cluster validity indices (CVIs) simultaneously to determine the optimal number of clusters and cluster assignments. Twelve real-world datasets from different domains are used to evaluate the method. The results show that the proposed AutoJAYA algorithm is able to accurately detect the number of clusters in each dataset and achieve good performance according to various CVIs, demonstrating its effectiveness at automatic unsupervised data classification.
Top-K Dominating Queries on Incomplete Data with Prioritiesijtsrd
This document discusses algorithms for finding the top-K dominating queries on incomplete datasets. It proposes using a skyline-based algorithm that incorporates priority values for each dimension. This allows the algorithm to determine dominance even when the values are missing for some dimensions. It works by bucketing the data based on bit representations, finding the local skylines within each bucket, and then calculating scores for objects based on their dominance over other objects while considering the priority of dimensions. The top-K objects with the highest scores are then returned as the results. This approach provides more accurate outputs for applications like movie recommendations by allowing users to specify dimension priorities.
In the present day huge amount of data is generated in every minute and transferred frequently. Although
the data is sometimes static but most commonly it is dynamic and transactional. New data that is being
generated is getting constantly added to the old/existing data. To discover the knowledge from this
incremental data, one approach is to run the algorithm repeatedly for the modified data sets which is time
consuming. Again to analyze the datasets properly, construction of efficient classifier model is necessary.
The objective of developing such a classifier is to classify unlabeled dataset into appropriate classes. The
paper proposes a dimension reduction algorithm that can be applied in dynamic environment for
generation of reduced attribute set as dynamic reduct, and an optimization algorithm which uses the
reduct and build up the corresponding classification system. The method analyzes the new dataset, when it
becomes available, and modifies the reduct accordingly to fit the entire dataset and from the entire data
set, interesting optimal classification rule sets are generated. The concepts of discernibility relation,
attribute dependency and attribute significance of Rough Set Theory are integrated for the generation of
dynamic reduct set, and optimal classification rules are selected using PSO method, which not only
reduces the complexity but also helps to achieve higher accuracy of the decision system. The proposed
method has been applied on some benchmark dataset collected from the UCI repository and dynamic
reduct is computed, and from the reduct optimal classification rules are also generated. Experimental
result shows the efficiency of the proposed method.
OPTIMIZATION IN ENGINE DESIGN VIA FORMAL CONCEPT ANALYSIS USING NEGATIVE ATTR...csandit
There is an exhaustive study around the area of engine design that covers different methods that try to reduce costs of production and to optimize the performance of these engines.
Mathematical methods based in statistics, self-organized maps and neural networks reach the best results in these designs but there exists the problem that configuration of these methods is
not an easy work due the high number of parameters that have to be measured.
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
Data partitioning methods are used to partition the data values with similarity. Similarity
measures are used to estimate transaction relationships. Hierarchical clustering model produces tree
structured results. Partitioned clustering produces results in grid format. Text documents are
unstructured data values with high dimensional attributes. Document clustering group ups unlabeled text
documents into meaningful clusters. Traditional clustering methods require cluster count (K) for the
document grouping process. Clustering accuracy degrades drastically with reference to the unsuitable
cluster count.
Textual data elements are divided into two types’ discriminative words and nondiscriminative
words. Only discriminative words are useful for grouping documents. The involvement of
nondiscriminative words confuses the clustering process and leads to poor clustering solution in return.
A variation inference algorithm is used to infer the document collection structure and partition of
document words at the same time. Dirichlet Process Mixture (DPM) model is used to partition
documents. DPM clustering model uses both the data likelihood and the clustering property of the
Dirichlet Process (DP). Dirichlet Process Mixture Model for Feature Partition (DPMFP) is used to
discover the latent cluster structure based on the DPM model. DPMFP clustering is performed without
requiring the number of clusters as input.
Document labels are used to estimate the discriminative word identification process. Concept
relationships are analyzed with Ontology support. Semantic weight model is used for the document
similarity analysis. The system improves the scalability with the support of labels and concept relations
for dimensionality reduction process.
With these components in place, we present the Data
Science Machine — an automated system for generating
predictive models from raw data. It starts with a relational
database and automatically generates features to be used
for predictive modeling.
This document provides an overview of different techniques for clustering categorical data. It discusses various clustering algorithms that have been used for categorical data, including K-modes, ROCK, COBWEB, and EM algorithms. It also reviews more recently developed algorithms for categorical data clustering, such as algorithms based on particle swarm optimization, rough set theory, and feature weighting schemes. The document concludes that clustering categorical data remains an important area of research, with opportunities to develop techniques that initialize cluster centers better.
ENTROPY-COST RATIO MAXIMIZATION MODEL FOR EFFICIENT STOCK PORTFOLIO SELECTION...cscpconf
This paper introduces a new stock portfolio selection model in non-stochastic environment.Following the principle of maximum entropy, a new entropy-cost ratio function is introduced as
the objective function. The uncertain returns, risks and ividends of the securities are considered as interval numbers. Along with the objective function, eight different types of constraints are used in the model to convert it into a pragmatic one. Three different models have been proposed by defining the future inancial market optimistically, pessimistically and in hecombined form to model the portfolio selection problem. To illustrate the effectiveness and tractability of the proposed models, these are tested on a set of data from Bombay Stock Exchange (BSE). The solution has been done by genetic algorithm.
This document summarizes a research paper that proposes a dynamic approach to improving the k-means clustering algorithm. The proposed approach aims to address two weaknesses of the standard k-means algorithm: its requirement of prior knowledge of the number of clusters k, and its sensitivity to initialization. The approach determines initial cluster centroids by segmenting the data space and selecting high-frequency segments. It then uses the silhouette validity index to dynamically determine the optimal number of clusters k, rather than requiring the user to specify k. The approach is compared to the standard k-means algorithm and other modified approaches, and is shown to improve initial center selection and reduce computation time.
A Novel Approach for Clustering Big Data based on MapReduce IJECEIAES
Clustering is one of the most important applications of data mining. It has attracted attention of researchers in statistics and machine learning. It is used in many applications like information retrieval, image processing and social network analytics etc. It helps the user to understand the similarity and dissimilarity between objects. Cluster analysis makes the users understand complex and large data sets more clearly. There are different types of clustering algorithms analyzed by various researchers. Kmeans is the most popular partitioning based algorithm as it provides good results because of accurate calculation on numerical data. But Kmeans give good results for numerical data only. Big data is combination of numerical and categorical data. Kprototype algorithm is used to deal with numerical as well as categorical data. Kprototype combines the distance calculated from numeric and categorical data. With the growth of data due to social networking websites, business transactions, scientific calculation etc., there is vast collection of structured, semi-structured and unstructured data. So, there is need of optimization of Kprototype so that these varieties of data can be analyzed efficiently.In this work, Kprototype algorithm is implemented on MapReduce in this paper. Experiments have proved that Kprototype implemented on Mapreduce gives better performance gain on multiple nodes as compared to single node. CPU execution time and speedup are used as evaluation metrics for comparison.Intellegent splitter is proposed in this paper which splits mixed big data into numerical and categorical data. Comparison with traditional algorithms proves that proposed algorithm works better for large scale of data.
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...IJRES Journal
The document presents a mathematical programming approach for selecting important variables in cluster analysis. It formulates a nonlinear binary model to minimize the distance between observations within clusters, using indicator variables to select important variables. The model is applied to a sample dataset of 30 observations across 5 variables, correctly identifying variables 3, 4 and 5 as most important for clustering the observations into two groups. The results are compared to an existing variable selection heuristic, with the mathematical programming approach achieving a 100% correct classification versus 97% for the other method.
Using particle swarm optimization to solve test functions problemsriyaniaes
In this paper the benchmarking functions are used to evaluate and check the particle swarm optimization (PSO) algorithm. However, the functions utilized have two dimension but they selected with different difficulty and with different models. In order to prove capability of PSO, it is compared with genetic algorithm (GA). Hence, the two algorithms are compared in terms of objective functions and the standard deviation. Different runs have been taken to get convincing results and the parameters are chosen properly where the Matlab software is used. Where the suggested algorithm can solve different engineering problems with different dimension and outperform the others in term of accuracy and speed of convergence.
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Seval Çapraz
This document analyzes a dataset of diabetes records from 130 US hospitals from 1999-2008 using various statistical data analysis and machine learning techniques. It first performs dimensionality reduction using principal component analysis (PCA) and multidimensional scaling (MDS). It then clusters the data using hierarchical clustering and k-means clustering. Cluster validity is assessed using precision. Spectral clustering is also applied and validated using Dunn and Davies-Bouldin indexes, with complete linkage diameter performing best.
The document discusses machine learning algorithms including logistic regression, random forests, support vector machines (SVM), and analysis of variance (ANOVA). It provides descriptions of how each algorithm works, its advantages, and examples of applications. Logistic regression uses a sigmoid function to predict binary outcomes. Random forests create an ensemble of decision trees to make classifications. SVM finds the optimal separating hyperplane between classes. ANOVA splits variability in a data set into systematic and random factors.
The D-basis Algorithm for Association Rules of High ConfidenceITIIIndustries
We develop a new approach for distributed computing of the association rules of high confidence on the attributes/columns of a binary table. It is derived from the D-basis algorithm developed by K.Adaricheva and J.B.Nation (Theoretical Computer Science, 2017), which runs multiple times on sub-tables of a given binary table, obtained by removing one or more rows. The sets of rules retrieved at these runs are then aggregated. This allows us to obtain a basis of association rules of high confidence, which can be used for ranking all attributes of the table with respect to a given fixed attribute. This paper focuses on some algorithmic details and the technical implementation of the new algorithm. Results are given for tests performed on random, synthetic and real data
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACHcscpconf
Due to the intangible nature of “software”, accurate and reliable software effort estimation is a challenge in the software Industry. It is unlikely to expect very accurate estimates of software
development effort because of the inherent uncertainty in software development projects and the complex and dynamic interaction of factors that impact software development. Heterogeneity exists in the software engineering datasets because data is made available from diverse sources.
This can be reduced by defining certain relationship between the data values by classifying them into different clusters. This study focuses on how the combination of clustering and
regression techniques can reduce the potential problems in effectiveness of predictive efficiency due to heterogeneity of the data. Using a clustered approach creates the subsets of data having a degree of homogeneity that enhances prediction accuracy. It was also observed in this study that ridge regression performs better than other regression techniques used in the analysis.
Performance Comparision of Machine Learning AlgorithmsDinusha Dilanka
In this paper Compare the performance of two
classification algorithm. I t is useful to differentiate
algorithms based on computational performance rather
than classification accuracy alone. As although
classification accuracy between the algorithms is similar,
computational performance can differ significantly and it
can affect to the final results. So the objective of this paper
is to perform a comparative analysis of two machine
learning algorithms namely, K Nearest neighbor,
classification and Logistic Regression. In this paper it
was considered a large dataset of 7981 data points and 112
features. Then the performance of the above mentioned
machine learning algorithms are examined. In this paper
the processing time and accuracy of the different machine
learning techniques are being estimated by considering the
collected data set, over a 60% for train and remaining
40% for testing. The paper is organized as follows. In
Section I, introduction and background analysis of the
research is included and in section II, problem statement.
In Section III, our application and data analyze Process,
the testing environment, and the Methodology of our
analysis are being described briefly. Section IV comprises
the results of two algorithms. Finally, the paper concludes
with a discussion of future directions for research by
eliminating the problems existing with the current
research methodology.
A new model for iris data set classification based on linear support vector m...IJECEIAES
1. The authors propose a new model for classifying the iris data set using a linear support vector machine (SVM) classifier with genetic algorithm optimization of the SVM's C and gamma parameters.
2. Principal component analysis was used to reduce the iris data set features from four to three before classification.
3. The genetic algorithm was shown to optimize the SVM parameters, achieving 98.7% accuracy on the iris data set classification compared to 95.3% accuracy without parameter optimization.
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...IJCSES Journal
Scrutiny for presage is the era of advance statistics where accuracy matter the most. Commensurate between algorithms with statistical implementation provides better consequence in terms of accurate prediction by using data sets. Prolific usage of algorithms lead towards the simplification of mathematical models, which provide less manual calculations. Presage is the essence of data science and machine learning requisitions that impart control over situations. Implementation of any dogmas require proper feature extraction which helps in the proper model building that assist in precision. This paper is predominantly based on different statistical analysis which includes correlation significance and proper categorical data distribution using feature engineering technique that unravel accuracy of different models of machine learning algorithms.
This article got published in the Software Developer's Journal's February Edition.
It describes the use of MapReduce paradigm to design Clustering algorithms and explain three algorithms using MapReduce.
- K-Means Clustering
- Canopy Clustering
- MinHash Clustering
Oversampling technique in student performance classification from engineering...IJECEIAES
This document discusses various oversampling techniques for dealing with imbalanced data in student performance classification. It compares SMOTE, Borderline-SMOTE, SVMSMOTE, and ADASYN oversampling combined with MLP, gradient boosting, AdaBoost, and random forest classifiers. The results show that Borderline-SMOTE gave the best performance for predicting the minority (low performance) class according to several evaluation metrics. SVMSMOTE also performed well overall, particularly for recall, F1-measure, and AUC. Gradient boosting provided high and consistent precision, recall, F1-measure, and AUC across the different oversampling methods.
An Heterogeneous Population-Based Genetic Algorithm for Data Clusteringijeei-iaes
As a primary data mining method for knowledge discovery, clustering is a technique of classifying a dataset into groups of similar objects. The most popular method for data clustering K-means suffers from the drawbacks of requiring the number of clusters and their initial centers, which should be provided by the user. In the literature, several methods have proposed in a form of k-means variants, genetic algorithms, or combinations between them for calculating the number of clusters and finding proper clusters centers. However, none of these solutions has provided satisfactory results and determining the number of clusters and the initial centers are still the main challenge in clustering processes. In this paper we present an approach to automatically generate such parameters to achieve optimal clusters using a modified genetic algorithm operating on varied individual structures and using a new crossover operator. Experimental results show that our modified genetic algorithm is a better efficient alternative to the existing approaches.
ADABOOST ENSEMBLE WITH SIMPLE GENETIC ALGORITHM FOR STUDENT PREDICTION MODELijcsit
Predicting the student performance is a great concern to the higher education managements.This
prediction helps to identify and to improve students' performance.Several factors may improve this
performance.In the present study, we employ the data mining processes, particularly classification, to
enhance the quality of the higher educational system. Recently, a new direction is used for the improvement
of the classification accuracy by combining classifiers.In thispaper, we design and evaluate a fastlearning
algorithm using AdaBoost ensemble with a simple genetic algorithmcalled “Ada-GA” where the genetic
algorithm is demonstrated to successfully improve the accuracy of the combined classifier performance.
The Ada-GA algorithm proved to be of considerable usefulness in identifying the students at risk early,
especially in very large classes. This early prediction allows the instructor to provide appropriate advising
to those students. The Ada/GA algorithm is implemented and tested on ASSISTments dataset, the results
showed that this algorithm hassuccessfully improved the detection accuracy as well as it reduces the
complexity of computation.
Fuzzy clustering has been widely studied and applied in a variety of key areas of science and
engineering. In this paper the Improved Teaching Learning Based Optimization (ITLBO)
algorithm is used for data clustering, in which the objects in the same cluster are similar. This
algorithm has been tested on several datasets and compared with some other popular algorithm
in clustering. Results have been shown that the proposed method improves the output of
clustering and can be efficiently used for fuzzy clustering.
Particle Swarm Optimization based K-Prototype Clustering Algorithm iosrjce
This document summarizes a research paper that proposes a new Particle Swarm Optimization (PSO) based K-Prototype clustering algorithm to cluster mixed numeric and categorical data. It begins with background information on clustering algorithms like K-Means, K-Modes, and K-Prototype. It then describes the K-Prototype algorithm, PSO, and discrete binary PSO. Related work integrating PSO with other clustering algorithms is also reviewed. The proposed approach uses binary PSO to select improved initial prototypes for K-Prototype clustering in order to obtain better clustering results than traditional K-Prototype and avoid local optima.
This document discusses using particle swarm optimization to improve the k-prototype clustering algorithm. The k-prototype algorithm clusters data with both numeric and categorical attributes but can get stuck in local optima. The proposed method uses particle swarm optimization, a global optimization technique, to guide the k-prototype algorithm towards better clusterings. Particle swarm optimization models potential solutions as particles that explore the search space. It is integrated with k-prototype clustering to avoid locally optimal solutions and produce better clusterings. The method is tested on standard benchmark datasets and shown to outperform traditional k-modes and k-prototype clustering algorithms.
IRJET- Expert Independent Bayesian Data Fusion and Decision Making Model for ...IRJET Journal
This document proposes an expert independent Bayesian data fusion and decision making model for multi-sensor systems smart control. The model uses a Naive Bayes classifier to predict the system state based only on prior and current sensor data. Simulations of a three sensor system (soil temperature, air temperature, and moisture) achieved an overall prediction accuracy of more than 96%. However, real-world implementation of the proposed algorithm is still needed.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
The document discusses clustering documents using a multi-viewpoint similarity measure. It begins with an introduction to document clustering and common similarity measures like cosine similarity. It then proposes a new multi-viewpoint similarity measure that calculates similarity between documents based on multiple reference points, rather than just the origin. This allows a more accurate assessment of similarity. The document outlines an optimization algorithm used to cluster documents by maximizing the new similarity measure. It compares the new approach to existing document clustering methods and similarity measures.
Reduct generation for the incremental data using rough set theorycsandit
n today’s changing world huge amount of data is ge
nerated and transferred frequently.
Although the data is sometimes static but most comm
only it is dynamic and transactional. New
data that is being generated is getting constantly
added to the old/existing data. To discover the
knowledge from this incremental data, one approach
is to run the algorithm repeatedly for the
modified data sets which is time consuming. The pap
er proposes a dimension reduction
algorithm that can be applied in dynamic environmen
t for generation of reduced attribute set as
dynamic reduct.
The method analyzes the new dataset, when it become
s available, and modifies
the reduct accordingly to fit the entire dataset. T
he concepts of discernibility relation, attribute
dependency and attribute significance of Rough Set
Theory are integrated for the generation of
dynamic reduct set, which not only reduces the comp
lexity but also helps to achieve higher
accuracy of the decision system. The proposed metho
d has been applied on few benchmark
dataset collected from the UCI repository and a dyn
amic reduct is computed. Experimental
result shows the efficiency of the proposed method
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...IAEME Publication
This paper presents an approach based on applying an aggregated predictor formed by multiple versions of a multilayer neural network with a back-propagation optimization algorithm for helping the engineer to get a list of the most appropriate well-test interpretation models for a given set of pressure/ production data. The proposed method consists of three stages: (1) data decorrelation through principal component analysis to reduce the covariance between the variables and the dimension of the input layer in the artificial neural network, (2) bootstrap replicates of the learning set where the data is repeatedly sampled with a random split of the data into train sets and using these as new learning sets, and (3) automatic reservoir model identification through aggregated predictor formed by a plurality vote when predicting a new class. This method is described in detail to ensure successful replication of results. The required training and test dataset were generated by using analytical solution models. In our case, there were used 600 samples: 300 for training, 100 for cross-validation, and 200 for testing. Different network structures were tested during this study to arrive at optimum network design. We notice that the single net methodology always brings about confusion in selecting the correct model even though the training results for the constructed networks are close to 1. We notice also that the principal component analysis is an effective strategy in reducing the number of input features, simplifying the network structure, and lowering the training time of the ANN. The results obtained show that the proposed model provides better performance when predicting new data with a coefficient of correlation approximately equal to 95% Compared to a previous approach 80%, the combination of the PCA and ANN is more stable and determine the more accurate results with lesser computational complexity than was feasible previously. Clearly, the aggregated predictor is more stable and shows less bad classes compared to the previous approach.
Data mining or Knowledge discovery (KDD) is
extracting unknown (hidden) and useful knowledge from data.
Data mining is widely used in many areas like retail, sales, ecommerce,
remote sensing, bioinformatics etc. Student’s
performance has become one of the most complex puzzle for
universities and colleges in recent past with the tremendous
growth. In this paper, authors deployed data mining techniques
like classification, association rule, chi-square etc. for knowledge
discovery. For this study, authors have used data set containing
Approx. 180 MCA (post graduate) students results data of 3
colleges. Study found that one can apply data mining
functionalities like Chi-square, Association rule and Lift in
Education and discover areas of improvement.
Clustering technology has been applied in numerous applications. It can enhance the performance
of information retrieval systems, it can also group Internet users to help improve the click-through rate of
on-line advertising, etc. Over the past few decades, a great many data clustering algorithms have been
developed, including K-Means, DBSCAN, Bi-Clustering and Spectral clustering, etc. In recent years, two
new data clustering algorithms have been proposed, which are affinity propagation (AP, 2007) and density
peak based clustering (DP, 2014). In this work, we empirically compare the performance of these two latest
data clustering algorithms with state-of-the-art, using 6 external and 2 internal clustering validation metrics.
Our experimental results on 16 public datasets show that, the two latest clustering algorithms, AP and DP,
do not always outperform DBSCAN. Therefore, to find the best clustering algorithm for a specific dataset, all
of AP, DP and DBSCAN should be considered. Moreover, we find that the comparison of different clustering
algorithms is closely related to the clustering evaluation metrics adopted. For instance, when using the
Silhouette clustering validation metric, the overall performance of K-Means is as good as AP and DP. This
work has important reference values for researchers and engineers who need to select appropriate clustering
algorithms for their specific applications.
This document summarizes a study that uses a genetic algorithm to optimize imputing missing cost data for fans used in road tunnels in Sweden. The genetic algorithm is used to impute the missing cost data by optimizing the valid data period used. The results show highly correlated data (R-squared of 0.99) after imputing the missing values, indicating the genetic algorithm provides an effective way to optimize imputing and create complete data that can then be used for forecasting and life cycle cost analysis. The document also reviews other methods for data imputation and discusses experimental results comparing the proposed two-stage approach using K-means clustering and multilayer perceptron on several datasets.
K-Means clustering uses an iterative procedure which is very much sensitive and dependent upon the initial centroids. The initial centroids in the k-means clustering are chosen randomly, and hence the clustering also changes with respect to the initial centroids. This paper tries to overcome this problem of random selection of centroids and hence change of clusters with a premeditated selection of initial centroids. We have used the iris, abalone and wine data sets to demonstrate that the proposed method of finding the initial centroids and using the centroids in k-means algorithm improves the clustering performance. The clustering also remains the same in every run as the initial centroids are not randomly selected but through premeditated method.
Similar to A Preference Model on Adaptive Affinity Propagation (20)
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...IJECEIAES
Medical image analysis has witnessed significant advancements with deep learning techniques. In the domain of brain tumor segmentation, the ability to
precisely delineate tumor boundaries from magnetic resonance imaging (MRI)
scans holds profound implications for diagnosis. This study presents an ensemble convolutional neural network (CNN) with transfer learning, integrating
the state-of-the-art Deeplabv3+ architecture with the ResNet18 backbone. The
model is rigorously trained and evaluated, exhibiting remarkable performance
metrics, including an impressive global accuracy of 99.286%, a high-class accuracy of 82.191%, a mean intersection over union (IoU) of 79.900%, a weighted
IoU of 98.620%, and a Boundary F1 (BF) score of 83.303%. Notably, a detailed comparative analysis with existing methods showcases the superiority of
our proposed model. These findings underscore the model’s competence in precise brain tumor localization, underscoring its potential to revolutionize medical
image analysis and enhance healthcare outcomes. This research paves the way
for future exploration and optimization of advanced CNN models in medical
imaging, emphasizing addressing false positives and resource efficiency.
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
Neural network optimizer of proportional-integral-differential controller par...IJECEIAES
Wide application of proportional-integral-differential (PID)-regulator in industry requires constant improvement of methods of its parameters adjustment. The paper deals with the issues of optimization of PID-regulator parameters with the use of neural network technology methods. A methodology for choosing the architecture (structure) of neural network optimizer is proposed, which consists in determining the number of layers, the number of neurons in each layer, as well as the form and type of activation function. Algorithms of neural network training based on the application of the method of minimizing the mismatch between the regulated value and the target value are developed. The method of back propagation of gradients is proposed to select the optimal training rate of neurons of the neural network. The neural network optimizer, which is a superstructure of the linear PID controller, allows increasing the regulation accuracy from 0.23 to 0.09, thus reducing the power consumption from 65% to 53%. The results of the conducted experiments allow us to conclude that the created neural superstructure may well become a prototype of an automatic voltage regulator (AVR)-type industrial controller for tuning the parameters of the PID controller.
An improved modulation technique suitable for a three level flying capacitor ...IJECEIAES
This research paper introduces an innovative modulation technique for controlling a 3-level flying capacitor multilevel inverter (FCMLI), aiming to streamline the modulation process in contrast to conventional methods. The proposed
simplified modulation technique paves the way for more straightforward and
efficient control of multilevel inverters, enabling their widespread adoption and
integration into modern power electronic systems. Through the amalgamation of
sinusoidal pulse width modulation (SPWM) with a high-frequency square wave
pulse, this controlling technique attains energy equilibrium across the coupling
capacitor. The modulation scheme incorporates a simplified switching pattern
and a decreased count of voltage references, thereby simplifying the control
algorithm.
A review on features and methods of potential fishing zoneIJECEIAES
This review focuses on the importance of identifying potential fishing zones in seawater for sustainable fishing practices. It explores features like sea surface temperature (SST) and sea surface height (SSH), along with classification methods such as classifiers. The features like SST, SSH, and different classifiers used to classify the data, have been figured out in this review study. This study underscores the importance of examining potential fishing zones using advanced analytical techniques. It thoroughly explores the methodologies employed by researchers, covering both past and current approaches. The examination centers on data characteristics and the application of classification algorithms for classification of potential fishing zones. Furthermore, the prediction of potential fishing zones relies significantly on the effectiveness of classification algorithms. Previous research has assessed the performance of models like support vector machines, naïve Bayes, and artificial neural networks (ANN). In the previous result, the results of support vector machine (SVM) were 97.6% more accurate than naive Bayes's 94.2% to classify test data for fisheries classification. By considering the recent works in this area, several recommendations for future works are presented to further improve the performance of the potential fishing zone models, which is important to the fisheries community.
Electrical signal interference minimization using appropriate core material f...IJECEIAES
As demand for smaller, quicker, and more powerful devices rises, Moore's law is strictly followed. The industry has worked hard to make little devices that boost productivity. The goal is to optimize device density. Scientists are reducing connection delays to improve circuit performance. This helped them understand three-dimensional integrated circuit (3D IC) concepts, which stack active devices and create vertical connections to diminish latency and lower interconnects. Electrical involvement is a big worry with 3D integrates circuits. Researchers have developed and tested through silicon via (TSV) and substrates to decrease electrical wave involvement. This study illustrates a novel noise coupling reduction method using several electrical involvement models. A 22% drop in electrical involvement from wave-carrying to victim TSVs introduces this new paradigm and improves system performance even at higher THz frequencies.
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
Bibliometric analysis highlighting the role of women in addressing climate ch...IJECEIAES
Fossil fuel consumption increased quickly, contributing to climate change
that is evident in unusual flooding and draughts, and global warming. Over
the past ten years, women's involvement in society has grown dramatically,
and they succeeded in playing a noticeable role in reducing climate change.
A bibliometric analysis of data from the last ten years has been carried out to
examine the role of women in addressing the climate change. The analysis's
findings discussed the relevant to the sustainable development goals (SDGs),
particularly SDG 7 and SDG 13. The results considered contributions made
by women in the various sectors while taking geographic dispersion into
account. The bibliometric analysis delves into topics including women's
leadership in environmental groups, their involvement in policymaking, their
contributions to sustainable development projects, and the influence of
gender diversity on attempts to mitigate climate change. This study's results
highlight how women have influenced policies and actions related to climate
change, point out areas of research deficiency and recommendations on how
to increase role of the women in addressing the climate change and
achieving sustainability. To achieve more successful results, this initiative
aims to highlight the significance of gender equality and encourage
inclusivity in climate change decision-making processes.
Voltage and frequency control of microgrid in presence of micro-turbine inter...IJECEIAES
The active and reactive load changes have a significant impact on voltage
and frequency. In this paper, in order to stabilize the microgrid (MG) against
load variations in islanding mode, the active and reactive power of all
distributed generators (DGs), including energy storage (battery), diesel
generator, and micro-turbine, are controlled. The micro-turbine generator is
connected to MG through a three-phase to three-phase matrix converter, and
the droop control method is applied for controlling the voltage and
frequency of MG. In addition, a method is introduced for voltage and
frequency control of micro-turbines in the transition state from gridconnected mode to islanding mode. A novel switching strategy of the matrix
converter is used for converting the high-frequency output voltage of the
micro-turbine to the grid-side frequency of the utility system. Moreover,
using the switching strategy, the low-order harmonics in the output current
and voltage are not produced, and consequently, the size of the output filter
would be reduced. In fact, the suggested control strategy is load-independent
and has no frequency conversion restrictions. The proposed approach for
voltage and frequency regulation demonstrates exceptional performance and
favorable response across various load alteration scenarios. The suggested
strategy is examined in several scenarios in the MG test systems, and the
simulation results are addressed.
Enhancing battery system identification: nonlinear autoregressive modeling fo...IJECEIAES
Precisely characterizing Li-ion batteries is essential for optimizing their
performance, enhancing safety, and prolonging their lifespan across various
applications, such as electric vehicles and renewable energy systems. This
article introduces an innovative nonlinear methodology for system
identification of a Li-ion battery, employing a nonlinear autoregressive with
exogenous inputs (NARX) model. The proposed approach integrates the
benefits of nonlinear modeling with the adaptability of the NARX structure,
facilitating a more comprehensive representation of the intricate
electrochemical processes within the battery. Experimental data collected
from a Li-ion battery operating under diverse scenarios are employed to
validate the effectiveness of the proposed methodology. The identified
NARX model exhibits superior accuracy in predicting the battery's behavior
compared to traditional linear models. This study underscores the
importance of accounting for nonlinearities in battery modeling, providing
insights into the intricate relationships between state-of-charge, voltage, and
current under dynamic conditions.
Smart grid deployment: from a bibliometric analysis to a surveyIJECEIAES
Smart grids are one of the last decades' innovations in electrical energy.
They bring relevant advantages compared to the traditional grid and
significant interest from the research community. Assessing the field's
evolution is essential to propose guidelines for facing new and future smart
grid challenges. In addition, knowing the main technologies involved in the
deployment of smart grids (SGs) is important to highlight possible
shortcomings that can be mitigated by developing new tools. This paper
contributes to the research trends mentioned above by focusing on two
objectives. First, a bibliometric analysis is presented to give an overview of
the current research level about smart grid deployment. Second, a survey of
the main technological approaches used for smart grid implementation and
their contributions are highlighted. To that effect, we searched the Web of
Science (WoS), and the Scopus databases. We obtained 5,663 documents
from WoS and 7,215 from Scopus on smart grid implementation or
deployment. With the extraction limitation in the Scopus database, 5,872 of
the 7,215 documents were extracted using a multi-step process. These two
datasets have been analyzed using a bibliometric tool called bibliometrix.
The main outputs are presented with some recommendations for future
research.
Use of analytical hierarchy process for selecting and prioritizing islanding ...IJECEIAES
One of the problems that are associated to power systems is islanding
condition, which must be rapidly and properly detected to prevent any
negative consequences on the system's protection, stability, and security.
This paper offers a thorough overview of several islanding detection
strategies, which are divided into two categories: classic approaches,
including local and remote approaches, and modern techniques, including
techniques based on signal processing and computational intelligence.
Additionally, each approach is compared and assessed based on several
factors, including implementation costs, non-detected zones, declining
power quality, and response times using the analytical hierarchy process
(AHP). The multi-criteria decision-making analysis shows that the overall
weight of passive methods (24.7%), active methods (7.8%), hybrid methods
(5.6%), remote methods (14.5%), signal processing-based methods (26.6%),
and computational intelligent-based methods (20.8%) based on the
comparison of all criteria together. Thus, it can be seen from the total weight
that hybrid approaches are the least suitable to be chosen, while signal
processing-based methods are the most appropriate islanding detection
method to be selected and implemented in power system with respect to the
aforementioned factors. Using Expert Choice software, the proposed
hierarchy model is studied and examined.
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...IJECEIAES
The power generated by photovoltaic (PV) systems is influenced by
environmental factors. This variability hampers the control and utilization of
solar cells' peak output. In this study, a single-stage grid-connected PV
system is designed to enhance power quality. Our approach employs fuzzy
logic in the direct power control (DPC) of a three-phase voltage source
inverter (VSI), enabling seamless integration of the PV connected to the
grid. Additionally, a fuzzy logic-based maximum power point tracking
(MPPT) controller is adopted, which outperforms traditional methods like
incremental conductance (INC) in enhancing solar cell efficiency and
minimizing the response time. Moreover, the inverter's real-time active and
reactive power is directly managed to achieve a unity power factor (UPF).
The system's performance is assessed through MATLAB/Simulink
implementation, showing marked improvement over conventional methods,
particularly in steady-state and varying weather conditions. For solar
irradiances of 500 and 1,000 W/m2
, the results show that the proposed
method reduces the total harmonic distortion (THD) of the injected current
to the grid by approximately 46% and 38% compared to conventional
methods, respectively. Furthermore, we compare the simulation results with
IEEE standards to evaluate the system's grid compatibility.
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...IJECEIAES
Photovoltaic systems have emerged as a promising energy resource that
caters to the future needs of society, owing to their renewable, inexhaustible,
and cost-free nature. The power output of these systems relies on solar cell
radiation and temperature. In order to mitigate the dependence on
atmospheric conditions and enhance power tracking, a conventional
approach has been improved by integrating various methods. To optimize
the generation of electricity from solar systems, the maximum power point
tracking (MPPT) technique is employed. To overcome limitations such as
steady-state voltage oscillations and improve transient response, two
traditional MPPT methods, namely fuzzy logic controller (FLC) and perturb
and observe (P&O), have been modified. This research paper aims to
simulate and validate the step size of the proposed modified P&O and FLC
techniques within the MPPT algorithm using MATLAB/Simulink for
efficient power tracking in photovoltaic systems.
Adaptive synchronous sliding control for a robot manipulator based on neural ...IJECEIAES
Robot manipulators have become important equipment in production lines, medical fields, and transportation. Improving the quality of trajectory tracking for
robot hands is always an attractive topic in the research community. This is a
challenging problem because robot manipulators are complex nonlinear systems
and are often subject to fluctuations in loads and external disturbances. This
article proposes an adaptive synchronous sliding control scheme to improve trajectory tracking performance for a robot manipulator. The proposed controller
ensures that the positions of the joints track the desired trajectory, synchronize
the errors, and significantly reduces chattering. First, the synchronous tracking
errors and synchronous sliding surfaces are presented. Second, the synchronous
tracking error dynamics are determined. Third, a robust adaptive control law is
designed,the unknown components of the model are estimated online by the neural network, and the parameters of the switching elements are selected by fuzzy
logic. The built algorithm ensures that the tracking and approximation errors
are ultimately uniformly bounded (UUB). Finally, the effectiveness of the constructed algorithm is demonstrated through simulation and experimental results.
Simulation and experimental results show that the proposed controller is effective with small synchronous tracking errors, and the chattering phenomenon is
significantly reduced.
Remote field-programmable gate array laboratory for signal acquisition and de...IJECEIAES
A remote laboratory utilizing field-programmable gate array (FPGA) technologies enhances students’ learning experience anywhere and anytime in embedded system design. Existing remote laboratories prioritize hardware access and visual feedback for observing board behavior after programming, neglecting comprehensive debugging tools to resolve errors that require internal signal acquisition. This paper proposes a novel remote embeddedsystem design approach targeting FPGA technologies that are fully interactive via a web-based platform. Our solution provides FPGA board access and debugging capabilities beyond the visual feedback provided by existing remote laboratories. We implemented a lab module that allows users to seamlessly incorporate into their FPGA design. The module minimizes hardware resource utilization while enabling the acquisition of a large number of data samples from the signal during the experiments by adaptively compressing the signal prior to data transmission. The results demonstrate an average compression ratio of 2.90 across three benchmark signals, indicating efficient signal acquisition and effective debugging and analysis. This method allows users to acquire more data samples than conventional methods. The proposed lab allows students to remotely test and debug their designs, bridging the gap between theory and practice in embedded system design.
Detecting and resolving feature envy through automated machine learning and m...IJECEIAES
Efficiently identifying and resolving code smells enhances software project quality. This paper presents a novel solution, utilizing automated machine learning (AutoML) techniques, to detect code smells and apply move method refactoring. By evaluating code metrics before and after refactoring, we assessed its impact on coupling, complexity, and cohesion. Key contributions of this research include a unique dataset for code smell classification and the development of models using AutoGluon for optimal performance. Furthermore, the study identifies the top 20 influential features in classifying feature envy, a well-known code smell, stemming from excessive reliance on external classes. We also explored how move method refactoring addresses feature envy, revealing reduced coupling and complexity, and improved cohesion, ultimately enhancing code quality. In summary, this research offers an empirical, data-driven approach, integrating AutoML and move method refactoring to optimize software project quality. Insights gained shed light on the benefits of refactoring on code quality and the significance of specific features in detecting feature envy. Future research can expand to explore additional refactoring techniques and a broader range of code metrics, advancing software engineering practices and standards.
Smart monitoring technique for solar cell systems using internet of things ba...IJECEIAES
Rapidly and remotely monitoring and receiving the solar cell systems status parameters, solar irradiance, temperature, and humidity, are critical issues in enhancement their efficiency. Hence, in the present article an improved smart prototype of internet of things (IoT) technique based on embedded system through NodeMCU ESP8266 (ESP-12E) was carried out experimentally. Three different regions at Egypt; Luxor, Cairo, and El-Beheira cities were chosen to study their solar irradiance profile, temperature, and humidity by the proposed IoT system. The monitoring data of solar irradiance, temperature, and humidity were live visualized directly by Ubidots through hypertext transfer protocol (HTTP) protocol. The measured solar power radiation in Luxor, Cairo, and El-Beheira ranged between 216-1000, 245-958, and 187-692 W/m 2 respectively during the solar day. The accuracy and rapidity of obtaining monitoring results using the proposed IoT system made it a strong candidate for application in monitoring solar cell systems. On the other hand, the obtained solar power radiation results of the three considered regions strongly candidate Luxor and Cairo as suitable places to build up a solar cells system station rather than El-Beheira.
An efficient security framework for intrusion detection and prevention in int...IJECEIAES
Over the past few years, the internet of things (IoT) has advanced to connect billions of smart devices to improve quality of life. However, anomalies or malicious intrusions pose several security loopholes, leading to performance degradation and threat to data security in IoT operations. Thereby, IoT security systems must keep an eye on and restrict unwanted events from occurring in the IoT network. Recently, various technical solutions based on machine learning (ML) models have been derived towards identifying and restricting unwanted events in IoT. However, most ML-based approaches are prone to miss-classification due to inappropriate feature selection. Additionally, most ML approaches applied to intrusion detection and prevention consider supervised learning, which requires a large amount of labeled data to be trained. Consequently, such complex datasets are impossible to source in a large network like IoT. To address this problem, this proposed study introduces an efficient learning mechanism to strengthen the IoT security aspects. The proposed algorithm incorporates supervised and unsupervised approaches to improve the learning models for intrusion detection and mitigation. Compared with the related works, the experimental outcome shows that the model performs well in a benchmark dataset. It accomplishes an improved detection accuracy of approximately 99.21%.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
A review on techniques and modelling methodologies used for checking electrom...nooriasukmaningtyas
The proper function of the integrated circuit (IC) in an inhibiting electromagnetic environment has always been a serious concern throughout the decades of revolution in the world of electronics, from disjunct devices to today’s integrated circuit technology, where billions of transistors are combined on a single chip. The automotive industry and smart vehicles in particular, are confronting design issues such as being prone to electromagnetic interference (EMI). Electronic control devices calculate incorrect outputs because of EMI and sensors give misleading values which can prove fatal in case of automotives. In this paper, the authors have non exhaustively tried to review research work concerned with the investigation of EMI in ICs and prediction of this EMI using various modelling methodologies and measurement setups.
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...University of Maribor
Slides from talk presenting:
Aleš Zamuda: Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapter and Networking.
Presentation at IcETRAN 2024 session:
"Inter-Society Networking Panel GRSS/MTT-S/CIS
Panel Session: Promoting Connection and Cooperation"
IEEE Slovenia GRSS
IEEE Serbia and Montenegro MTT-S
IEEE Slovenia CIS
11TH INTERNATIONAL CONFERENCE ON ELECTRICAL, ELECTRONIC AND COMPUTING ENGINEERING
3-6 June 2024, Niš, Serbia
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
A Preference Model on Adaptive Affinity Propagation
1. International Journal of Electrical and Computer Engineering (IJECE)
Vol. 8, No. 3, June 2018, pp. 1805 – 1813
ISSN: 2088-8708 1805
Institute of Advanced Engineering and Science
w w w . i a e s j o u r n a l . c o m
A Preference Model on Adaptive Affinity Propagation
Rina Refianti1
, Achmad Benny Mutiara2
, Asep Juarna3
, and Adang Suhendra4
1,2,3
Faculty of Computer Science and Information Technology, Gunadarma University, Indonesia
4
Department of Informatics, Gunadarma University, Indonesia
Article Info
Article history:
Received Nov 17, 2017
Revised Feb 5, 2018
Accepted Feb 19, 2018
Keyword:
Affinity Propagation
Exemplars
Data Points
Similarity Matrix
Preference
ABSTRACT
In recent years, two new data clustering algorithms have been proposed. One of them is
Affinity Propagation (AP). AP is a new data clustering technique that use iterative mes-
sage passing and consider all data points as potential exemplars. Two important inputs of
AP are a similarity matrix (SM) of the data and the parameter ”preference” p. Although
the original AP algorithm has shown much success in data clustering, it still suffer from
one limitation: it is not easy to determine the value of the parameter ”preference” p
which can result an optimal clustering solution. To resolve this limitation, we propose
a new model of the parameter ”preference” p, i.e. it is modeled based on the similarity
distribution. Having the SM and p, Modified Adaptive AP (MAAP) procedure is running.
MAAP procedure means that we omit the adaptive p-scanning algorithm as in original
Adaptive-AP (AAP) procedure. Experimental results on random non-partition and parti-
tion data sets show that (i) the proposed algorithm, MAAP-DDP, is slower than original
AP for random non-partition dataset, (ii) for random 4-partition dataset and real datasets
the proposed algorithm has succeeded to identify clusters according to the number of
dataset’s true labels with the execution times that are comparable with those original AP.
Beside that the MAAP-DDP algorithm demonstrates more feasible and effective than
original AAP procedure.
Copyright c 2018 Institute of Advanced Engineering and Science.
All rights reserved.
Corresponding Author:
Achmad Benny Mutiara
Faculty of Computer Science and Information Technology, Gunadarma University
Jl. Margonda Raya No.100, Depok 16424, Indonesia
+62-815-916055
amutiara@staff.gunadarma.ac.id
1. INTRODUCTION
In big data era, analysis of large amounts of data become as essential area in Computer Science. The
methods of data mining, among others clustering methods, classification methods, etc., are needed to extract or
mine the knowledge from large amounts of data. To group the data in accordance with their multiple-characteristic
based similarities is known as clustering [1].
In recent years, there are two new proposed data clustering algorithms. One of them is Affinity Propa-
gation (AP) that has been proposed by Brendan J. Frey and Delbert Dueck (2007) [2]. Unlike previous clustering
method such as k-means which taking random data points as first potential exemplars, AP considers all the data
points as potential cluster centers [3, 4]. AP works by taking an input of similarity between data points and simul-
taneously considers all the data points as potential cluster centers which called exemplars by iteratively calculating
responsibility r and availability a based on the similarity until converge. After the points converge, AP found
clusters with far less error than k-means and it takes place in less than one hundredth of the amount of time [3].
AP have several advantage over other clustering methods due to AP consideration of all data points as
potential exemplars while most clustering methods find exemplars by recording and tracing the fixed data points
and iteratively correcting it [3]. Because of it, most clustering methods does not change the set that much and just
keep tracking on the particular sets. Furthermore, AP supports non-symmetrical similarities and it does not depend
on initialization that found on other clustering algorithms. Because of these advantages, it has been successfully
applied in many disciplines such as image clustering [3] and Chinese calligraphy clustering [5].
Journal Homepage: http://iaesjournal.com/online/index.php/IJECE
Institute of Advanced Engineering and Science
w w w . i a e s j o u r n a l . c o m
, DOI: 10.11591/ijece.v8i3.pp1805-1813
2. 1806 ISSN: 2088-8708
The paper is the continue study of our previous works [6, 7]. In [6] we survey and investigate the per-
formance of various AP approaches, i.e. Adaptive AP, Partition AP, Soft Constraint AP, and Fuzzy Statistic AP.
And it is found that i) Partition AP (PAP) is the fastest one among four other approaches, ii) Adaptive AP (AAP)
can remove the oscillation and more stable than the other, and iii) Fuzzy Statistic AP (FSAP) could result smaller
cluster number than the other approach, because its preferences are generated by using fuzzy-statistical methods it-
eratively. In [7] we investigate a time complexity of various AP approaches, i.e. AAP, PAP, Landmark AP (L-AP),
and K-AP. And it is found that the approach that has the most efficient computational cost and the fastest running
time is Landmark AP, although its clustering result is very different in clusters number than AP.
Although AP itself has been proven to be faster than k-means, and it also has shown much success in data
clustering, AP still has a limitation, i.e. it is not easy to determine the value of the parameter ”preference” p which
can result an optimal clustering solution [8]. The goal of this paper is to resolve this limitation with proposing a
new model of the parameter ”preference” p, i.e. it is modeled based on the distribution of similarity. The model
will be explain in the subsection 3.1.. Then it will be applied to Adaptive AP (AAP). This is because AAP has a
better level of accuracy than other approaches.
In experiment random non-partition dataset, random partition dataset, and real dataset from UCI datasets
[9] are used. By partition of dataset k-means algorithm [10] is applied to generate a four groups of data points.
The results of our experiment are shown in section 4..
2. THEORETICAL BACKGROUND
2.1. Affinity Propagation
2.1.1. Input Preference
Supposed we have a set of data points X = {x1, x2, x3, ..., xn}, AP takes as input of similarity matrix
(SM) between data points s, where each similarity s(i, j) shows how good data point xj is fitted to be an exemplar
for xi. The similarities of any type can be accepted, e.g. for real data, the negative Euclidean distance, and for
non-metric data the Jaccard coefficient , so AP can be widely applied in different areas [7].
Instead of requiring that the number of clusters be predetermined, AP takes as input a real number s(j, j)
for each data point j so that data points with larger values of s(j, j) are more likely to be chosen as exemplars.
These values are referenced to as ”preferences”. These preferences will affect the number of clusters produced.
The preferences values can be the median of the similarities or their minimum.
p = median(s(:)), or, p = min(s(:)) (1)
2.1.2. Messages passing
Supposed we have similarity s(i, j), (i, j = 1, 2, ..., n), AP attempts to obtain the best exemplars that can
make the net similarity maximized, i.e. the roundly sum of similarities between all exemplars and their member
data points. Process in AP can be viewed as passing values between data points. There are two values that are
passed between data points: responsibility and availability. Responsibility r(i, j) shows how well-suited point j is
to serve as the exemplar for point i, taking into account other potential exemplars for point i. Availability a(i, j)
reflects the accumulated evidence for how appropriate it would be for point i to choose point j as its exemplar,
taking into account the support from other points that point j should be an exemplar.
Figure 1 shows us how the availability and responsibility works in AP. Availabilities a(i, j) are transmitted
from candidate exemplars to data points to state the availability of candidate exemplars to data points as cluster
point. Responsibilities r(i, j) are transmitted from data points to candidate exemplars and state how strongly each
data point favors the candidate exemplar over other candidate exemplars. All of this message passings are kept
done until convergence is met or the iteration reach a certain number.
Initially all r(i, j) and a(i, j) are set to 0, and iteratively their values are updated as follows until conver-
gence values achieved:
r(i, j) =
s(i, j) − maxk=j{a(i, k) + s(i, k)} (i = j)
s(i, j) − maxk=j{s(i, k)} (i = j)
(2)
a(i, j) =
min{0, r(j, j)} + k=i,j max{0, r(k, j)} (i = j)
k=i max{0, r(k, j)} (i = j)
(3)
IJECE Vol. 8, No. 3, June 2018: 1805 – 1813
3. IJECE ISSN: 2088-8708 1807
Figure 1. Message Passing in Affinity Propagation [3]
After calculating both responsibility and availability, iteratively their values are updated as follows until
convergence values achieved:
r(i, j) = (1 − λ)r(i, j) + λro(i, j) (4)
a(i, j) = (1 − λ)a(i, j) + λao(i, j) (5)
where λ is a damping factor modeled to reduce numerical oscillations, and ro(i, j) and ao(i, j) are previous values
of responsibility and availability. λ should have value that is greater than or equal to 0.5 and less than 1, i.e.
0.5 ≤ λ < 1. A high value of λ may make number oscillations avoided, but this is not guaranteed, and a low value
of λ will make the AP run slowly [11].
2.1.3. Exemplar decision
For a set of data point xi, if xj can reach the maximal of r(i, j) + a(i, j), then it could be deduced that i)
xj is the most suitable exemplar for xi , or that ii) xj would be the most exemplar of xi. The Exemplar for xi is
selected as the following formula:
ci ← argmaxk{r(i, j) + a(i, j)} (6)
where ci is the exemplar for xi.
2.2. Adaptive Affinity Propagation
AP has many extensions. One of the extension is Adaptive Affinity Propagation (AAP). AAP is designed
to solve AP limitation : we can not know what value of preference can result the best clustering solution, and if
oscillations occurs, it cannot be eliminated automatically. To solve the problem, AAP can adapt to the need of the
data sets by configuring the value of preferences and damping factor.
As in [11] we assume that C(i) is the number of clusters in the iteration. We summarize the AAP
algorithm as follows:
Algorithm 1 Affinity Propagation Adaptif
Input : Data Points xi, i = 1, 2, ..., n
Output : Centers of Clusters C(i)
1: Set parameters λ = 0.5;
2: Calculate s(i, j);
3: Set r(i, j) = 0 and a(i, j) = 0;
4: Run AP steps (Eq.2 - Eq.6);
5: Verify whether oscillations occur or not. If oscillations occur, then λ ← λ + λstep, else run AP steps continu-
ally.
6: If C(i) ≤ C(i + 1), then p ← p + pstep, and s(i, i) ← p. Go to step 4.
As proposed in [11] when the values of C(i) is lower than 2, the AAP algorithm stops.
AAP has shown a better quality or at least same quality in making a clustering result as AP and finding
optimal solution based on different kind of data sets [11]. AAP has shown to be able to process several type of data
A Preference Model on Adaptive Affinity Propagation (Rina Refianti)
4. 1808 ISSN: 2088-8708
such as gene expression [11], travel route [11], image clustering [11, 12], a mixed numbers and categorical dataset
[13], text document [14], and zoogeographical regions [15].
3. PROPOSED WORK
3.1. Proposed Preference Model
From a set of data points X = {x1, x2, ..., xn}, supposed we have randomly two data points xi and xj, if
distance from xi to other points is larger than to xj , then xi has a lower possibility than xj to be the dataset center.
On the basis of this assumption, preference for each data point can be computed as follows.
For a given data point xi, similarities from xi to other data points are summed up as:
TDS(i) =
n
j=1,j=i
s(xi, xj) (7)
Then it is normalized as:
NTDS(i) =
TDS(i)
n
j TDS(j)
(8)
Finally, for each data point preference can be defined as follows :
p(i) = s(i, i) = N × NTDS(i) × Const. (9)
where Const can be real non-zero number or min(s(:))
Equation (9) of preferences represents and reflects the distribution of data set, and also we hope that it
will tend to better results as shown in results section. Then this model is applied to Modified Adaptive Affinity
Propagation algorithm (MAAP), as explain in subsection 3.2.. Our model is simple and easy to apply if we compare
to a another model proposed by Ping Li et.al (2017) [16].
3.2. Modified Adaptive Affinity Propagation (MAAP)
We modify adaptive AP in the following manner
Algorithm 2 Modified Adaptive Affinity Propagation (MAAP)
Input : Data Points xi, i = 1, 2, ..., n
Output : Centers of Clusters C(i)
1: Set parameters λ = 0.5;
2: Calculate s(i, j), and set p as Eq.9
3: Set r(i, j) = 0 and a(i, j) = 0;
4: Run AP steps (Eq.2 - Eq.6);
5: Verify whether oscillations occur or not. If oscillations occur, then λ ← λ + λstep, else run AP steps continu-
ally.
Because we set the proposed preferences p in algorithm, then we could omit the step 6 in Algorithm 1.
3.3. Cluster Evaluation
Silhouette validation index and Fowlkes-Mallows index [17] are used to evaluate the quality of a clustering
process.
For a given dataset X = {x1, x2, ..., xn}, xi ∈ Rm
, the Silhouette index of xi can be defined as
Sil(xi) = (b(xi) − a(xi))/max(a(xi), b(xi)) (10)
where a(xi) is defined as a mean distance from other data points in the same cluster Kc to xi; d(xi, Kc ) represents
a mean distance from all data points in cluster Kc (c = c) to xi. If b(xi) is defined as
b(xi) = min(d(xi, Kc )), (11)
c = 1, 2, .., C, (C represents the number of cluster) (C = C).
IJECE Vol. 8, No. 3, June 2018: 1805 – 1813
5. IJECE ISSN: 2088-8708 1809
And for a given cluster X = {x1, x2, ..., xn}, the Silhouette Index of the X dataset can be expressed as
follows:
Sil(X) = (
n
i=1
Sil(xi))/n (12)
Fowlkes-Mallows Index(FMI) is an external criteria [17], and defined as
FMI =
a
a + b
×
a
a + c
(13)
where a is the number of data with the same label and classified in the same cluster, b is the number of data with
the same label but classified in different clusters, and c is the number of data with different labels but classified in
the same cluster.
4. RESULTS AND DISCUSSION
MAAP algorithm is written and ran with MATLAB R2014b. The test was carried on 4GB RAM Intel(R)
Core(TM) i5-2670QM 2.20 GHz machine. We test this MAAP algorithm with:
• two-dimensional random non-partition data point sets of size 100, 500, 1000, 1500 and 2000 respectively to
view the scala;
• two-dimensional random partition data point sets of size 100, 300, 500, 800, and 1000 respectively. The
random non-partition data points are generated using uniform distribution from 0 to 1. The random partition
data points are divided into four group and generated using k-means algorithm.
• Real datasets are used as shown in table 1. These datasets can be downloaded from UCI-Repository [9].
Table 1. UCI Datasets
Datasets True Cluster Number of Sampel Dimension
4k2 far 4 400 2
Ionosphere 2 351 4
Iris 3 150 4
Wine 3 178 13
For the similarity, we use negative Euclidean’s distance from the data points: for points xi and xj,
s(i, j) = − xi − xj
2
.
4.1. Experiments on Random Non-Partition Dataset
The clustering results on random non-partition dataset are presented in table 2, table 3. And Figure 2
shows an example result with number of data N = 1000. From those tables and Figure 2, although the number of
cluster NC resulting from the MAAP-DDP algorithm are almost the same those from the original AP with p =
min(s), MAAP-DDP algorithm is slower than original AP both with p = median(s) and with p = min(s). This
make sense, because MAAP-DDP algorithm searches adaptively the λ-value in order to eliminate the oscillations.
For various values of N Silhouette index (Sil) for both algorithm are almost the same with the range from
0.3 to 0.45. Interestingly, the Sil value from MAAP-DDP looks more constant, around 0,325. It means that the
clusters resulting from the MAAP-DDP is more stable than those from original AP. The FMI can not be calculated
because the random non-partition dataset do not have true labels.
4.2. Experiments on Random Partition Dataset
The clustering results on random 4-partition dataset are presented in table 4, table 5. And Figure 3 shows
an example result with number of data N = 1000. From those tables and Figure 3, the MAAP-DDP has succeeded
to identify clusters according to the number of dataset’s true labels. The number of dataset’s true labels is 4, and
the number of clusters (NC) resulting from the MAAP-DDP is also 4 for various values of N.
The speed of the MAAP-DDP is comparable with those of original AP, it means that the execution time
of MAAP-DDP is not slower than those original AP. Although the Sil values of the MAAP-DDP are smaller than
A Preference Model on Adaptive Affinity Propagation (Rina Refianti)
6. 1810 ISSN: 2088-8708
Table 2. AP Clustering Results on Random non-partition Data
AP (p = median(s)) AP (p = min(s))
N NC ET(s) Sil NC ET(s) Sil
100 10 0.021 0.440 3 0.117 0.420
500 19 0.963 0.374 8 0.531 0.384
1000 27 3.232 0.369 10 2.723 0.371
1500 34 11.019 0.361 12 8.233 0.359
2000 37 16.193 0.358 15 11.022 0.367
N : Number of data NC : Number of Cluster ET : Execution Time
Sil : Silhoutte Index FMI : Fowlkes-Mallows Index
Table 3. MAAP-DDP Clustering Results on Random non-partition Data
MAAP-DDP
N NC ET(s) Sil Const
100 6 0.048 0.325 1.0
500 8 2.183 0.325 2.0
1000 10 8.734 0.318 2.0
1500 13 18.055 0.325 2.0
2000 15 31.620 0.323 2.0
(a) (b)
(c) (d)
Figure 2. (a) Non-partition random data N = 1000; (b) AP with p = min(s) for non-partition random data
N = 1000; (c) AP with p = median(s) for non-partition random fata N = 1000; (d) MAAP with distributed-
data based p for non-partition random data N = 1000;.
those of AP, FMI index values of the MAAP-DDP are greater than those of original AP. It means that the MAAP-
DDP is better in recovering the true clustering structure. The key parameter of the MAAP-DDP algorithm is Const
in Eq.9. The algorithm is designed to find adaptively Const-value for obtaining the best clustering solution.
IJECE Vol. 8, No. 3, June 2018: 1805 – 1813
7. IJECE ISSN: 2088-8708 1811
(a) (b)
(c) (d)
Figure 3. (a) 4-Partition random data N = 1000; (b) AP with p = min(s) for 4-partition random data N = 1000;
(c) AP with p = median(s) for 4-partition random data N = 1000; (d) MAAP with distributed-data based p for
4-partition random fata N = 1000;.
Table 4. AP Clustering Results on Random 4-partition Data
AP (p= median(s)) AP (p= min(s))
N NC ET(s) Sil FMI NC ET(s) Sil FMI
100 11 0.094 0.358 0.58 2 0.090 0.368 0.718
300 18 0.477 0.350 0.38 4 0.502 0.338 0.639
500 27 1.502 0.336 0.35 5 1.444 0.292 0.556
800 35 4.252 0.348 0.31 7 4.872 0.317 0.457
1000 36 6.800 0.343 0.31 9 5.642 0.304 0.474
N : Number of data NC : Number of Cluster ET : Execution Time
Sil : Silhoutte Index FMI : Fowlkes-Mallows Index
Table 5. MAAP-DDP Clustering Results on Random 4-partition Data
MAAP-DDP
N NC ET(s) Sil FM Const
100 4 0.164 0.300 0.421 0.029
300 4 0.578 0.224 0.545 6.457
500 4 0.898 0.278 0.438 0.410
800 4 3.641 0.228 0.531 27.554
1000 4 6.312 0.262 0.474 30.850
4.3. Experiments on Real Datasets
From table 6 and table 7 it shows that MAAP has successfully identified the cluster of all real data, i.e.
the number of clusters is equal to the number of true clusters (TC), while the original AP did not all succeeded
A Preference Model on Adaptive Affinity Propagation (Rina Refianti)
8. 1812 ISSN: 2088-8708
to identify. The original AP with p(k) is the minimum value of s(i, k) succeeded to identify clusters of three real
datasets, except for the real datasets of the Ionosphere. The key parameter of the MAAP-DDP algorithm is Const
in Eq.9. The algorithm is designed to find adaptively Const-value for obtaining the best clustering solution.
The success of MAAP-DDP is supported by the values of it’s Sil that are greater than 0,300, and it’s
FMI that are greater than . The values of it’s Sil are as follows: for 4k2 far 0.455; for Ionosphere 0,513; for Iris
0.522 and for Wine 0.365. The values of it’s FMI are as follows: for 4k2 far 0.681; for Ionosphere 0.719; for Iris
0.679 and for Wine 0.622.
Table 6. AP Clustering Results on Real Datasets
AP (p = min(s)) AP (p = med(s))
TC NC ET(s) Sil FMI NC ET(s) Sil FMI
4k2 far 4 4 0,790 0,761 1,000 6 0,808 0,450 0,743
Ionosphere 2 4 0,633 0,386 0,575 28 1,517 0,444 0,413
Iris 3 3 0,129 0,541 0,809 6 0,103 0,469 0,724
Wine 3 3 0,165 0,491 0,830 11 0,174 0,314 0,468
TC : True Cluster NC : Number of Cluster ET : Execution Time
Sil : Silhoutte Index FMI : Fowlkes-Mallows Index
Table 7. MAAP-DDP Clustering Results on Real Datasets
MAAP-DDP
TC NC ET(s) Sil FMI Const
4k2 far 4 4 1,318 0,455 0,681 4,879
Ionosphere 2 2 1,228 0,513 0,719 160,200
Iris 3 3 0,771 0,522 0,679 9,252
Wine 3 3 0,795 0,365 0,622 7,559
5. CONCLUSIONS AND FUTURE RESEARCH
From above results it can be concluded: (i) the proposed algorithm, MAAP-DDP, is slower than original
AP for random non-partition dataset, (ii) for random 4-partition dataset and real datasets the proposed algorithm
has succeeded to identify clusters according to the number of dataset’s true labels with the execution times that are
comparable with those original AP.
Beside that the MAAP-DDP algorithm demonstrates more feasible and effective than original AAP pro-
cedure. As we know that the original AAP algorithm stops when the values of C(i) is lower than 2, while MAAP-
DDP algorithm terminates after the best clusters founded. The key parameter of the MAAP-DDP algorithm is
Const in Eq.9. The algorithm is designed to find adaptively Const-value for obtaining the best clustering solu-
tion.
In the future, for verification of the algorithm we have to test the MAAP-DDP algorithm with the other
kinds of dataset, e.g. synthetics dataset , face-image dataset, and so on.
ACKNOWLEDGMENT
The Authors gratefully acknowledge Gunadarma University for providing research funding and for per-
mission in using the research facilities.
REFERENCES
[1] K. R. Nirmal and K. V. V. Satyanarayana, “Issues of k means clustering while migrating to map reduce
paradigm with big data: A survey,” International Journal of Electrical and Computer Engineering (IJECE),
vol. 6, no. 6, pp. 3047–3051, December 2016.
[2] X. Shi, W. Wang, and C. Zhang, “An empirical comparison of latest data clustering algorithms with state-
of-the-art,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 5, no. 2, pp. 410–415,
February 2017.
[3] B. J. Frey and D. Dueck, “Clustering by passing messages between data points,” Science, pp. 972–976, 2007.
[4] J. Macqueen, “Some methods for classification and analysis of multivariate observations,” in In 5-th Berkeley
Symposium on Mathematical Statistics and Probability, 1967, pp. 281–297.
IJECE Vol. 8, No. 3, June 2018: 1805 – 1813
9. IJECE ISSN: 2088-8708 1813
[5] D. Xia, F. Wu, X. Zhang, and Y. Zhuang, “Local and global approaches of affinity propagation clustering for
large scale data,” CoRR, vol. abs/0910.1650, 2009. [Online]. Available: http://arxiv.org/abs/0910.1650
[6] R. Refianti, A. Mutiara, and A. Syamsudduha, “Performance evaluation of affinity propagation approaches
on data clustering,” International Journal of Advanced Computer Science and Applications(IJACSA), vol. 7,
no. 3, 2016.
[7] R. Refianti, A. Mutiara, and S. Gunawan, “Time complexity comparison between affinity propagation algo-
rithms,” Journal of Theoretical and Applied Information Technology, vol. 95, no. 7, pp. 1497–1505, April
2017.
[8] V. K. Ramana, “Enhanced fuzzy based affinity propagation for multispectral images,” Imperial Journal of
Interdisciplinary Research (IJIR), vol. 12, no. 12, pp. 1548–1551, 2016.
[9] A. Asuncion and D. Newman, “UCI machine learning repository,” University of Cali-
fornia, Irvine, School of Information and Computer Sciences, 2007. [Online]. Available:
http://www.ics.uci.edu/∼mlearn/MLRepository.html
[10] J. Kogan, Introduction to Clustering Large and High-Dimensional Data. Cambridge University Press, 2007.
[11] K. Wang, J. Zhang, D. Li, X. Zhang, and T. Guo, “Adaptive affinity propagation clustering,” CoRR, vol.
abs/0805.1096, 2008. [Online]. Available: http://arxiv.org/abs/0805.1096
[12] R. Refianti, A. Mutiara, and A. Syamsudduha, “Performance evaluation of affinity propagation approaches
on data clustering,” International Journal of Advanced Computer Science and Applications(IJACSA), vol. 7,
no. 3, 2016.
[13] K. Zhang and X. Gu, “An affinity propagation clustering algorithm for mixed numeric and categorical
datasets,” Mathematical Problems in Engineering, September 2014.
[14] Y. He, Q. Chen, X. Wang, R. Xu, X. Bai, and X. Meng, “An adaptive affinity propagation document clus-
tering,” in Informatics and Systems (INFOS), 2010 The 7th International Conference on, March 2010, pp.
1–7.
[15] M. Rueda, M. A. Rodriguez, and B. A. Hawkins, “Identifying global zoogeographical regions: Lessons from
wallace,” Journal of Biogeography, vol. 40, no. 12, pp. 2215–2225, Dec. 2013.
[16] P. Li, H. Ji, B. Wang, Z. Huang, and H. Li, “Adjustable preference affinity propagation clustering,” Pattern
Recognition Letters, vol. 85, pp. 72–78, January 2017.
[17] P. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,” J. Comp
App. Math, vol. 20, pp. 53–65, 1987.
A Preference Model on Adaptive Affinity Propagation (Rina Refianti)