The document describes a research article that proposes applying a parallel glowworm swarm optimization algorithm for clustering large unstructured data sets. The algorithm uses optimized glowworm swarms to evaluate the clustering problem and find multiple cluster centroids. It employs the MapReduce framework for parallelization, which balances the load, localizes the data, and provides fault tolerance. Experiments showed the algorithm scales well with increasing data set sizes and achieves near-linear speed while maintaining high-quality clustering, demonstrating it is more efficient than traditional algorithms for clustering large unstructured data.
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...IJDKP
Huge volume of data from domain specific applications such as medical, financial, library, telephone,
shopping records and individual are regularly generated. Sharing of these data is proved to be beneficial
for data mining application. On one hand such data is an important asset to business decision making by
analyzing it. On the other hand data privacy concerns may prevent data owners from sharing information
for data analysis. In order to share data while preserving privacy, data owner must come up with a solution
which achieves the dual goal of privacy preservation as well as an accuracy of data mining task –
clustering and classification. An efficient and effective approach has been proposed that aims to protect
privacy of sensitive information and obtaining data clustering with minimum information loss
Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach IJECEIAES
Defining the correct number of clusters is one of the most fundamental tasks in graph clustering. When it comes to large graphs, this task becomes more challenging because of the lack of prior information. This paper presents an approach to solve this problem based on the Bat Algorithm, one of the most promising swarm intelligence based algorithms. We chose to call our solution, “Bat-Cluster (BC).” This approach allows an automation of graph clustering based on a balance between global and local search processes. The simulation of four benchmark graphs of different sizes shows that our proposed algorithm is efficient and can provide higher precision and exceed some best-known values.
Survey on Efficient Techniques of Text Miningvivatechijri
In the current era, with the advancement of technology, more and more data is available in digital
form. Among which, most of the data (approx. 85%) is in unstructured textual form. So it has become essential to
develop better techniques and algorithms to extract useful and interesting information from this large amount of
textual data. Text mining is process of extracting useful data from unstructured text. The algorithm used for text
mining has advantages and disadvantages. Moreover the issues in the field of text mining that affect the accuracy
and relevance of the results are identified.
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...IJECEIAES
Data analysis plays a prominent role in interpreting various phenomena. Data mining is the process to hypothesize useful knowledge from the extensive data. Based upon the classical statistical prototypes the data can be exploited beyond the storage and management of the data. Cluster analysis a primary investigation with little or no prior knowledge, consists of research and development across a wide variety of communities. Cluster ensembles are melange of individual solutions obtained from different clusterings to produce final quality clustering which is required in wider applications. The method arises in the perspective of increasing robustness, scalability and accuracy. This paper gives a brief overview of the generation methods and consensus functions included in cluster ensemble. The survey is to analyze the various techniques and cluster ensemble methods.
Data Analytics on Solar Energy Using HadoopIJMERJOURNAL
ABSRACT: Missing data is one of the major issues in data miningand pattern recognition. The knowledge contains in attributeswith missing data values are important in improvingregression correlationprocessofanorganization.Thelearningprocesson each instance is necessary as it may containsome exceptional knowledge. There are various methods tohandle missing data in regression correlation. Analysis of photovoltaic cell, Sunlight striking on different geographical location to know the defective or connectionless photovoltaic cell plate. we mainly aims To showcasing the energy produced at different geographical location and to find the defective plate. And also analysis the data sets of energy produced along with current weather in particular area to know the status of the photovoltaic plates. In this project, we used Hadoop Map-Reduce framework to analyze the solar energy datasets.
TUPLE VALUE BASED MULTIPLICATIVE DATA PERTURBATION APPROACH TO PRESERVE PRIVA...IJDKP
Huge volume of data from domain specific applications such as medical, financial, library, telephone,
shopping records and individual are regularly generated. Sharing of these data is proved to be beneficial
for data mining application. On one hand such data is an important asset to business decision making by
analyzing it. On the other hand data privacy concerns may prevent data owners from sharing information
for data analysis. In order to share data while preserving privacy, data owner must come up with a solution
which achieves the dual goal of privacy preservation as well as an accuracy of data mining task –
clustering and classification. An efficient and effective approach has been proposed that aims to protect
privacy of sensitive information and obtaining data clustering with minimum information loss
Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach IJECEIAES
Defining the correct number of clusters is one of the most fundamental tasks in graph clustering. When it comes to large graphs, this task becomes more challenging because of the lack of prior information. This paper presents an approach to solve this problem based on the Bat Algorithm, one of the most promising swarm intelligence based algorithms. We chose to call our solution, “Bat-Cluster (BC).” This approach allows an automation of graph clustering based on a balance between global and local search processes. The simulation of four benchmark graphs of different sizes shows that our proposed algorithm is efficient and can provide higher precision and exceed some best-known values.
Survey on Efficient Techniques of Text Miningvivatechijri
In the current era, with the advancement of technology, more and more data is available in digital
form. Among which, most of the data (approx. 85%) is in unstructured textual form. So it has become essential to
develop better techniques and algorithms to extract useful and interesting information from this large amount of
textual data. Text mining is process of extracting useful data from unstructured text. The algorithm used for text
mining has advantages and disadvantages. Moreover the issues in the field of text mining that affect the accuracy
and relevance of the results are identified.
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...IJECEIAES
Data analysis plays a prominent role in interpreting various phenomena. Data mining is the process to hypothesize useful knowledge from the extensive data. Based upon the classical statistical prototypes the data can be exploited beyond the storage and management of the data. Cluster analysis a primary investigation with little or no prior knowledge, consists of research and development across a wide variety of communities. Cluster ensembles are melange of individual solutions obtained from different clusterings to produce final quality clustering which is required in wider applications. The method arises in the perspective of increasing robustness, scalability and accuracy. This paper gives a brief overview of the generation methods and consensus functions included in cluster ensemble. The survey is to analyze the various techniques and cluster ensemble methods.
Data Analytics on Solar Energy Using HadoopIJMERJOURNAL
ABSRACT: Missing data is one of the major issues in data miningand pattern recognition. The knowledge contains in attributeswith missing data values are important in improvingregression correlationprocessofanorganization.Thelearningprocesson each instance is necessary as it may containsome exceptional knowledge. There are various methods tohandle missing data in regression correlation. Analysis of photovoltaic cell, Sunlight striking on different geographical location to know the defective or connectionless photovoltaic cell plate. we mainly aims To showcasing the energy produced at different geographical location and to find the defective plate. And also analysis the data sets of energy produced along with current weather in particular area to know the status of the photovoltaic plates. In this project, we used Hadoop Map-Reduce framework to analyze the solar energy datasets.
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATAcscpconf
In Data mining applications, which often involve complex data like multiple heterogeneous data sources, user preferences, decision-making actions and business impacts etc., the complete useful information cannot be obtained by using single data mining method in the form of informative patterns as that would consume more time and space, if and only if it is possible to join large relevant data sources for discovering patterns consisting of various aspects of useful information. We consider combined mining as an approach for mining informative patterns
from multiple data-sources or multiple-features or by multiple-methods as per the requirements. In combined mining approach, we applied Lossy-counting algorithm on each data-source to get the frequent data item-sets and then get the combined association rules. In multi-feature combined mining approach, we obtained pair patterns and cluster patterns and then generate incremental pair patterns and incremental cluster patterns, which cannot be directly generated by the existing methods. In multi-method combined mining approach, we combine FP-growth and Bayesian Belief Network to make a classifier to get more informative knowledge.
Combined mining approach to generate patterns for complex datacsandit
In Data mining applications, which often involve complex data like multiple heterogeneous data
sources, user preferences, decision-making actions and business impacts etc., the complete
useful information cannot be obtained by using single data mining method in the form of
informative patterns as that would consume more time and space, if and only if it is possible to
join large relevant data sources for discovering patterns consisting of various aspects of useful
information. We consider combined mining as an approach for mining informative patterns
from multiple data-sources or multiple-features or by multiple-methods as per the requirements.
In combined mining approach, we applied Lossy-counting algorithm on each data-source to get
the frequent data item-sets and then get the combined association rules. In multi-feature
combined mining approach, we obtained pair patterns and cluster patterns and then generate
incremental pair patterns and incremental cluster patterns, which cannot be directly generated
by the existing methods. In multi-method combined mining approach, we combine FP-growth
and Bayesian Belief Network to make a classifier to get more informative knowledge.
Research Inventy : International Journal of Engineering and Science is published by the group of young academic and industrial researchers with 12 Issues per year. It is an online as well as print version open access journal that provides rapid publication (monthly) of articles in all areas of the subject such as: civil, mechanical, chemical, electronic and computer engineering as well as production and information technology. The Journal welcomes the submission of manuscripts that meet the general criteria of significance and scientific excellence. Papers will be published by rapid process within 20 days after acceptance and peer review process takes only 7 days. All articles published in Research Inventy will be peer-reviewed.
Many data mining and knowledge discovery methodologies and process models have been developed, with varying degrees of success, there are three main methods used to discover patterns in data; KDD, SEMMA and CRISP-DM. They are presented in many of the publications of the area and are used in practice. To our knowledge, there is no clear methodology developed to support link mining. However, there is a well known methodology in knowledge discovery in databases, known as Cross Industry Standard Process for Data Mining (CRISPDM), developed by a consortium of several industrial companies which can be relevant to the study of link mining. In this study CRISP-DM has been adapted to the field of Link mining to detect anomalies. An important goal in link mining is the task of inferring links that are not yet known in a given network. This approach is implemented through the use of a case study of real world data (co-citation data). This case study aims to use mutual information to interpret the semantics of anomalies identified in co-citation, dataset that can provide valuable insights in determining the nature of a given link and potentially identifying important future link relationships
A Survey on Constellation Based Attribute Selection Method for High Dimension...IJERA Editor
Attribute Selection is an important topic in Data Mining, because it is the effective way for reducing dimensionality, removing irrelevant data, removing redundant data, & increasing accuracy of the data. It is the process of identifying a subset of the most useful attributes that produces compatible results as the original entire set of attribute. Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group called a cluster are more similar in some sense or another to each other than to those in other groups (Clusters). There are various approaches & techniques for attribute subset selection namely Wrapper approach, Filter Approach, Relief Algorithm, Distributional clustering etc. But each of one having some disadvantages like unable to handle large volumes of data, computational complexity, accuracy is not guaranteed, difficult to evaluate and redundancy detection etc. To get the upper hand on some of these issues in attribute selection this paper proposes a technique that aims to design an effective clustering based attribute selection method for high dimensional data. Initially, attributes are divided into clusters by using graph-based clustering method like minimum spanning tree (MST). In the second step, the most representative attribute that is strongly related to target classes is selected from each cluster to form a subset of attributes. The purpose is to increase the level of accuracy, reduce dimensionality; shorter training time and improves generalization by reducing over fitting.
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSEditor IJCATR
This paper presents a hybrid data mining approach based on supervised learning and unsupervised learning to identify the closest data patterns in the data base. This technique enables to achieve the maximum accuracy rate with minimal complexity. The proposed algorithm is compared with traditional clustering and classification algorithm and it is also implemented with multidimensional datasets. The implementation results show better prediction accuracy and reliability.
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEIJDKP
Metadata represents the information about data to be stored in Data Warehouses. It is a mandatory
element of Data Warehouse to build an efficient Data Warehouse. Metadata helps in data integration,
lineage, data quality and populating transformed data into data warehouse. Spatial data warehouses are
based on spatial data mostly collected from Geographical Information Systems (GIS) and the transactional
systems that are specific to an application or enterprise. Metadata design and deployment is the most
critical phase in building of data warehouse where it is mandatory to bring the spatial information and
data modeling together. In this paper, we present a holistic metadata framework that drives metadata
creation for spatial data warehouse. Theoretically, the proposed metadata framework improves the
efficiency of accessing of data in response to frequent queries on SDWs. In other words, the proposed
framework decreases the response time of the query and accurate information is fetched from Data
Warehouse including the spatial information
With the development of database, the data volume stored in database increases rapidly and in the large
amounts of data much important information is hidden. If the information can be extracted from the
database they will create a lot of profit for the organization. The question they are asking is how to extract
this value. The answer is data mining. There are many technologies available to data mining practitioners,
including Artificial Neural Networks, Genetics, Fuzzy logic and Decision Trees. Many practitioners are
wary of Neural Networks due to their black box nature, even though they have proven themselves in many
situations. This paper is an overview of artificial neural networks and questions their position as a
preferred tool by data mining practitioners.
Ijricit 01-002 enhanced replica detection in short time for large data setsIjripublishers Ijri
Similarity check of real world entities is a necessary factor in these days which is named as Data Replica Detection.
Time is an critical factor today in tracking Data Replica Detection for large data sets, without having impact over quality
of Dataset. In this we primarily introduce two Data Replica Detection algorithms , where in these contribute enhanced
procedural standards in finding Data Replication at limited execution periods.This contribute better improvised state
of time than conventional techniques . We propose two Data Replica Detection algorithms namely progressive sorted
neighborhood method (PSNM), which performs best on small and almost clean datasets, and progressive blocking (PB),
which performs best on large and very grimy datasets. Both enhance the efficiency of duplicate detection even on very
large datasets.
Data mining is utilized to manage huge measure of information which are put in the data ware houses and databases, to discover required information and data. Numerous data mining systems have been proposed, for example, association rules, decision trees, neural systems, clustering, and so on. It has turned into the purpose of consideration from numerous years. A re-known amongst the available data mining strategies is clustering of the dataset. It is the most effective data mining method. It groups the dataset in number of clusters based on certain guidelines that are predefined. It is dependable to discover the connection between the distinctive characteristics of data.
In k-mean clustering algorithm, the function is being selected on the basis of the relevancy of the function for predicting the data and also the Euclidian distance between the centroid of any cluster and the data objects outside the cluster is being computed for the clustering the data points. In this work, author enhanced the Euclidian distance formula to increase the cluster quality.
The problem of accuracy and redundancy of the dissimilar points in the clusters remains in the improved k-means for which new enhanced approach is been proposed which uses the similarity function for checking the similarity level of the point before including it to the cluster.
In the present day huge amount of data is generated in every minute and transferred frequently. Although
the data is sometimes static but most commonly it is dynamic and transactional. New data that is being
generated is getting constantly added to the old/existing data. To discover the knowledge from this
incremental data, one approach is to run the algorithm repeatedly for the modified data sets which is time
consuming. Again to analyze the datasets properly, construction of efficient classifier model is necessary.
The objective of developing such a classifier is to classify unlabeled dataset into appropriate classes. The
paper proposes a dimension reduction algorithm that can be applied in dynamic environment for
generation of reduced attribute set as dynamic reduct, and an optimization algorithm which uses the
reduct and build up the corresponding classification system. The method analyzes the new dataset, when it
becomes available, and modifies the reduct accordingly to fit the entire dataset and from the entire data
set, interesting optimal classification rule sets are generated. The concepts of discernibility relation,
attribute dependency and attribute significance of Rough Set Theory are integrated for the generation of
dynamic reduct set, and optimal classification rules are selected using PSO method, which not only
reduces the complexity but also helps to achieve higher accuracy of the decision system. The proposed
method has been applied on some benchmark dataset collected from the UCI repository and dynamic
reduct is computed, and from the reduct optimal classification rules are also generated. Experimental
result shows the efficiency of the proposed method.
Document Classification Using Expectation Maximization with Semi Supervised L...ijsc
As the amount of online document increases, the demand for document classification to aid the analysis and management of document is increasing. Text is cheap, but information, in the form of knowing what classes a document belongs to, is expensive. The main purpose of this paper is to explain the expectation maximization technique of data mining to classify the document and to learn how to improve the accuracy while using semi-supervised approach. Expectation maximization algorithm is applied with both supervised and semi-supervised approach. It is found that semi-supervised approach is more accurate and effective. The main advantage of semi supervised approach is “DYNAMICALLY GENERATION OF NEW CLASS”. The algorithm first trains a classifier using the labeled document and probabilistically classifies the
unlabeled documents. The car dataset for the evaluation purpose is collected from UCI repository dataset in which some changes have been done from our side.
Particle Swarm Optimization based K-Prototype Clustering Algorithm iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATAcscpconf
In Data mining applications, which often involve complex data like multiple heterogeneous data sources, user preferences, decision-making actions and business impacts etc., the complete useful information cannot be obtained by using single data mining method in the form of informative patterns as that would consume more time and space, if and only if it is possible to join large relevant data sources for discovering patterns consisting of various aspects of useful information. We consider combined mining as an approach for mining informative patterns
from multiple data-sources or multiple-features or by multiple-methods as per the requirements. In combined mining approach, we applied Lossy-counting algorithm on each data-source to get the frequent data item-sets and then get the combined association rules. In multi-feature combined mining approach, we obtained pair patterns and cluster patterns and then generate incremental pair patterns and incremental cluster patterns, which cannot be directly generated by the existing methods. In multi-method combined mining approach, we combine FP-growth and Bayesian Belief Network to make a classifier to get more informative knowledge.
Combined mining approach to generate patterns for complex datacsandit
In Data mining applications, which often involve complex data like multiple heterogeneous data
sources, user preferences, decision-making actions and business impacts etc., the complete
useful information cannot be obtained by using single data mining method in the form of
informative patterns as that would consume more time and space, if and only if it is possible to
join large relevant data sources for discovering patterns consisting of various aspects of useful
information. We consider combined mining as an approach for mining informative patterns
from multiple data-sources or multiple-features or by multiple-methods as per the requirements.
In combined mining approach, we applied Lossy-counting algorithm on each data-source to get
the frequent data item-sets and then get the combined association rules. In multi-feature
combined mining approach, we obtained pair patterns and cluster patterns and then generate
incremental pair patterns and incremental cluster patterns, which cannot be directly generated
by the existing methods. In multi-method combined mining approach, we combine FP-growth
and Bayesian Belief Network to make a classifier to get more informative knowledge.
Research Inventy : International Journal of Engineering and Science is published by the group of young academic and industrial researchers with 12 Issues per year. It is an online as well as print version open access journal that provides rapid publication (monthly) of articles in all areas of the subject such as: civil, mechanical, chemical, electronic and computer engineering as well as production and information technology. The Journal welcomes the submission of manuscripts that meet the general criteria of significance and scientific excellence. Papers will be published by rapid process within 20 days after acceptance and peer review process takes only 7 days. All articles published in Research Inventy will be peer-reviewed.
Many data mining and knowledge discovery methodologies and process models have been developed, with varying degrees of success, there are three main methods used to discover patterns in data; KDD, SEMMA and CRISP-DM. They are presented in many of the publications of the area and are used in practice. To our knowledge, there is no clear methodology developed to support link mining. However, there is a well known methodology in knowledge discovery in databases, known as Cross Industry Standard Process for Data Mining (CRISPDM), developed by a consortium of several industrial companies which can be relevant to the study of link mining. In this study CRISP-DM has been adapted to the field of Link mining to detect anomalies. An important goal in link mining is the task of inferring links that are not yet known in a given network. This approach is implemented through the use of a case study of real world data (co-citation data). This case study aims to use mutual information to interpret the semantics of anomalies identified in co-citation, dataset that can provide valuable insights in determining the nature of a given link and potentially identifying important future link relationships
A Survey on Constellation Based Attribute Selection Method for High Dimension...IJERA Editor
Attribute Selection is an important topic in Data Mining, because it is the effective way for reducing dimensionality, removing irrelevant data, removing redundant data, & increasing accuracy of the data. It is the process of identifying a subset of the most useful attributes that produces compatible results as the original entire set of attribute. Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group called a cluster are more similar in some sense or another to each other than to those in other groups (Clusters). There are various approaches & techniques for attribute subset selection namely Wrapper approach, Filter Approach, Relief Algorithm, Distributional clustering etc. But each of one having some disadvantages like unable to handle large volumes of data, computational complexity, accuracy is not guaranteed, difficult to evaluate and redundancy detection etc. To get the upper hand on some of these issues in attribute selection this paper proposes a technique that aims to design an effective clustering based attribute selection method for high dimensional data. Initially, attributes are divided into clusters by using graph-based clustering method like minimum spanning tree (MST). In the second step, the most representative attribute that is strongly related to target classes is selected from each cluster to form a subset of attributes. The purpose is to increase the level of accuracy, reduce dimensionality; shorter training time and improves generalization by reducing over fitting.
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSEditor IJCATR
This paper presents a hybrid data mining approach based on supervised learning and unsupervised learning to identify the closest data patterns in the data base. This technique enables to achieve the maximum accuracy rate with minimal complexity. The proposed algorithm is compared with traditional clustering and classification algorithm and it is also implemented with multidimensional datasets. The implementation results show better prediction accuracy and reliability.
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEIJDKP
Metadata represents the information about data to be stored in Data Warehouses. It is a mandatory
element of Data Warehouse to build an efficient Data Warehouse. Metadata helps in data integration,
lineage, data quality and populating transformed data into data warehouse. Spatial data warehouses are
based on spatial data mostly collected from Geographical Information Systems (GIS) and the transactional
systems that are specific to an application or enterprise. Metadata design and deployment is the most
critical phase in building of data warehouse where it is mandatory to bring the spatial information and
data modeling together. In this paper, we present a holistic metadata framework that drives metadata
creation for spatial data warehouse. Theoretically, the proposed metadata framework improves the
efficiency of accessing of data in response to frequent queries on SDWs. In other words, the proposed
framework decreases the response time of the query and accurate information is fetched from Data
Warehouse including the spatial information
With the development of database, the data volume stored in database increases rapidly and in the large
amounts of data much important information is hidden. If the information can be extracted from the
database they will create a lot of profit for the organization. The question they are asking is how to extract
this value. The answer is data mining. There are many technologies available to data mining practitioners,
including Artificial Neural Networks, Genetics, Fuzzy logic and Decision Trees. Many practitioners are
wary of Neural Networks due to their black box nature, even though they have proven themselves in many
situations. This paper is an overview of artificial neural networks and questions their position as a
preferred tool by data mining practitioners.
Ijricit 01-002 enhanced replica detection in short time for large data setsIjripublishers Ijri
Similarity check of real world entities is a necessary factor in these days which is named as Data Replica Detection.
Time is an critical factor today in tracking Data Replica Detection for large data sets, without having impact over quality
of Dataset. In this we primarily introduce two Data Replica Detection algorithms , where in these contribute enhanced
procedural standards in finding Data Replication at limited execution periods.This contribute better improvised state
of time than conventional techniques . We propose two Data Replica Detection algorithms namely progressive sorted
neighborhood method (PSNM), which performs best on small and almost clean datasets, and progressive blocking (PB),
which performs best on large and very grimy datasets. Both enhance the efficiency of duplicate detection even on very
large datasets.
Data mining is utilized to manage huge measure of information which are put in the data ware houses and databases, to discover required information and data. Numerous data mining systems have been proposed, for example, association rules, decision trees, neural systems, clustering, and so on. It has turned into the purpose of consideration from numerous years. A re-known amongst the available data mining strategies is clustering of the dataset. It is the most effective data mining method. It groups the dataset in number of clusters based on certain guidelines that are predefined. It is dependable to discover the connection between the distinctive characteristics of data.
In k-mean clustering algorithm, the function is being selected on the basis of the relevancy of the function for predicting the data and also the Euclidian distance between the centroid of any cluster and the data objects outside the cluster is being computed for the clustering the data points. In this work, author enhanced the Euclidian distance formula to increase the cluster quality.
The problem of accuracy and redundancy of the dissimilar points in the clusters remains in the improved k-means for which new enhanced approach is been proposed which uses the similarity function for checking the similarity level of the point before including it to the cluster.
In the present day huge amount of data is generated in every minute and transferred frequently. Although
the data is sometimes static but most commonly it is dynamic and transactional. New data that is being
generated is getting constantly added to the old/existing data. To discover the knowledge from this
incremental data, one approach is to run the algorithm repeatedly for the modified data sets which is time
consuming. Again to analyze the datasets properly, construction of efficient classifier model is necessary.
The objective of developing such a classifier is to classify unlabeled dataset into appropriate classes. The
paper proposes a dimension reduction algorithm that can be applied in dynamic environment for
generation of reduced attribute set as dynamic reduct, and an optimization algorithm which uses the
reduct and build up the corresponding classification system. The method analyzes the new dataset, when it
becomes available, and modifies the reduct accordingly to fit the entire dataset and from the entire data
set, interesting optimal classification rule sets are generated. The concepts of discernibility relation,
attribute dependency and attribute significance of Rough Set Theory are integrated for the generation of
dynamic reduct set, and optimal classification rules are selected using PSO method, which not only
reduces the complexity but also helps to achieve higher accuracy of the decision system. The proposed
method has been applied on some benchmark dataset collected from the UCI repository and dynamic
reduct is computed, and from the reduct optimal classification rules are also generated. Experimental
result shows the efficiency of the proposed method.
Document Classification Using Expectation Maximization with Semi Supervised L...ijsc
As the amount of online document increases, the demand for document classification to aid the analysis and management of document is increasing. Text is cheap, but information, in the form of knowing what classes a document belongs to, is expensive. The main purpose of this paper is to explain the expectation maximization technique of data mining to classify the document and to learn how to improve the accuracy while using semi-supervised approach. Expectation maximization algorithm is applied with both supervised and semi-supervised approach. It is found that semi-supervised approach is more accurate and effective. The main advantage of semi supervised approach is “DYNAMICALLY GENERATION OF NEW CLASS”. The algorithm first trains a classifier using the labeled document and probabilistically classifies the
unlabeled documents. The car dataset for the evaluation purpose is collected from UCI repository dataset in which some changes have been done from our side.
Particle Swarm Optimization based K-Prototype Clustering Algorithm iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
A Novel Approach for Clustering Big Data based on MapReduce IJECEIAES
Clustering is one of the most important applications of data mining. It has attracted attention of researchers in statistics and machine learning. It is used in many applications like information retrieval, image processing and social network analytics etc. It helps the user to understand the similarity and dissimilarity between objects. Cluster analysis makes the users understand complex and large data sets more clearly. There are different types of clustering algorithms analyzed by various researchers. Kmeans is the most popular partitioning based algorithm as it provides good results because of accurate calculation on numerical data. But Kmeans give good results for numerical data only. Big data is combination of numerical and categorical data. Kprototype algorithm is used to deal with numerical as well as categorical data. Kprototype combines the distance calculated from numeric and categorical data. With the growth of data due to social networking websites, business transactions, scientific calculation etc., there is vast collection of structured, semi-structured and unstructured data. So, there is need of optimization of Kprototype so that these varieties of data can be analyzed efficiently.In this work, Kprototype algorithm is implemented on MapReduce in this paper. Experiments have proved that Kprototype implemented on Mapreduce gives better performance gain on multiple nodes as compared to single node. CPU execution time and speedup are used as evaluation metrics for comparison.Intellegent splitter is proposed in this paper which splits mixed big data into numerical and categorical data. Comparison with traditional algorithms proves that proposed algorithm works better for large scale of data.
Feature selection in high-dimensional datasets is
considered to be a complex and time-consuming problem. To
enhance the accuracy of classification and reduce the execution
time, Parallel Evolutionary Algorithms (PEAs) can be used. In
this paper, we make a review for the most recent works which
handle the use of PEAs for feature selection in large datasets.
We have classified the algorithms in these papers into four main
classes (Genetic Algorithms (GA), Particle Swarm Optimization
(PSO), Scattered Search (SS), and Ant Colony Optimization
(ACO)). The accuracy is adopted as a measure to compare the
efficiency of these PEAs. It is noticeable that the Parallel Genetic
Algorithms (PGAs) are the most suitable algorithms for feature
selection in large datasets; since they achieve the highest accuracy.
On the other hand, we found that the Parallel ACO is timeconsuming
and less accurate comparing with other PEA.
Performance Comparision of Machine Learning AlgorithmsDinusha Dilanka
In this paper Compare the performance of two
classification algorithm. I t is useful to differentiate
algorithms based on computational performance rather
than classification accuracy alone. As although
classification accuracy between the algorithms is similar,
computational performance can differ significantly and it
can affect to the final results. So the objective of this paper
is to perform a comparative analysis of two machine
learning algorithms namely, K Nearest neighbor,
classification and Logistic Regression. In this paper it
was considered a large dataset of 7981 data points and 112
features. Then the performance of the above mentioned
machine learning algorithms are examined. In this paper
the processing time and accuracy of the different machine
learning techniques are being estimated by considering the
collected data set, over a 60% for train and remaining
40% for testing. The paper is organized as follows. In
Section I, introduction and background analysis of the
research is included and in section II, problem statement.
In Section III, our application and data analyze Process,
the testing environment, and the Methodology of our
analysis are being described briefly. Section IV comprises
the results of two algorithms. Finally, the paper concludes
with a discussion of future directions for research by
eliminating the problems existing with the current
research methodology.
Nature Inspired Models And The Semantic WebStefan Ceriu
In this paper we present a series of nature inspired models used as alternative solutions for Semantic Web concerns. Some of the methods presented in this article perform better than classic algorithms by enhancing response time and computational costs. Others are just proof of concept, first steps towards new techniques that will improve their respective field. The intricate nature of the Semantic Web urges the need for faster, more intelligent algorithms and nature inspired models have been proven to be more than suitable for such complex tasks.
Survey on evolutionary computation tech techniques and its application in dif...ijitjournal
In computer science, 'evolutionary computation' is an algorithmic tool based on evolution. It implements
random variation, reproduction and selection by altering and moving data within a computer. It helps in
building, applying and studying algorithms based on the Darwinian principles of natural selection. In this
paper, studies about different evolutionary computation techniques used in some applications specifically
image processing, cloud computing and grid computing is carried out briefly. This work is an effort to help
researchers from different fields to have knowledge on the techniques of evolutionary computation
applicable in the above mentioned areas.
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
Data partitioning methods are used to partition the data values with similarity. Similarity
measures are used to estimate transaction relationships. Hierarchical clustering model produces tree
structured results. Partitioned clustering produces results in grid format. Text documents are
unstructured data values with high dimensional attributes. Document clustering group ups unlabeled text
documents into meaningful clusters. Traditional clustering methods require cluster count (K) for the
document grouping process. Clustering accuracy degrades drastically with reference to the unsuitable
cluster count.
Textual data elements are divided into two types’ discriminative words and nondiscriminative
words. Only discriminative words are useful for grouping documents. The involvement of
nondiscriminative words confuses the clustering process and leads to poor clustering solution in return.
A variation inference algorithm is used to infer the document collection structure and partition of
document words at the same time. Dirichlet Process Mixture (DPM) model is used to partition
documents. DPM clustering model uses both the data likelihood and the clustering property of the
Dirichlet Process (DP). Dirichlet Process Mixture Model for Feature Partition (DPMFP) is used to
discover the latent cluster structure based on the DPM model. DPMFP clustering is performed without
requiring the number of clusters as input.
Document labels are used to estimate the discriminative word identification process. Concept
relationships are analyzed with Ontology support. Semantic weight model is used for the document
similarity analysis. The system improves the scalability with the support of labels and concept relations
for dimensionality reduction process.
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERINGijcsa
Text Document Clustering is one of the fastest growing research areas because of availability of huge amount of information in an electronic form. There are several number of techniques launched for clustering documents in such a way that documents within a cluster have high intra-similarity and low inter-similarity to other clusters. Many document clustering algorithms provide localized search in effectively navigating, summarizing, and organizing information. A global optimal solution can be obtained by applying high-speed and high-quality optimization algorithms. The optimization technique performs a globalized search in the entire solution space. In this paper, a brief survey on optimization approaches to text document clustering is turned out.
Extended pso algorithm for improvement problems k means clustering algorithmIJMIT JOURNAL
The clustering is a without monitoring process and one of the most common data mining techniques. The
purpose of clustering is grouping similar data together in a group, so were most similar to each other in a
cluster and the difference with most other instances in the cluster are. In this paper we focus on clustering
partition k-means, due to ease of implementation and high-speed performance of large data sets, After 30
year it is still very popular among the developed clustering algorithm and then for improvement problem of
placing of k-means algorithm in local optimal, we pose extended PSO algorithm, that its name is ECPSO.
Our new algorithm is able to be cause of exit from local optimal and with high percent produce the
problem’s optimal answer. The probe of results show that mooted algorithm have better performance
regards as other clustering algorithms specially in two index, the carefulness of clustering and the quality
of clustering.
Clustering is also known as data segmentation aims to partitions data set into groups, clusters, according to their similarity. Cluster analysis has been extensively studied in many researches. There are many algorithms for different types of clustering. These classical algorithms can't be applied on big data due to its distinct features. It is a challenge to apply the traditional techniques on large unstructured data. This study proposes a hybrid model to cluster big data using the famous traditional K-means clustering algorithm. The proposed model consists of three phases namely; Mapper phase, Clustering Phase and Reduce phase. The first phase uses map-reduce algorithm to split big data into small datasets. Whereas, the second phase implements the traditional clustering K-means algorithm on each of the spitted small data sets. The last phase is responsible of producing the general clusters output of the complete data set. Two functions, Mode and Fuzzy Gaussian, have been implemented and compared at the last phase to determine the most suitable one. The experimental study used four benchmark big data sets; Covtype, Covtype-2, Poker, and Poker-2. The results proved the efficiency of the proposed model in clustering big data using the traditional K-means algorithm. Also, the experiments show that the Fuzzy Gaussian function produces more accurate results than the traditional Mode function.
Extended pso algorithm for improvement problems k means clustering algorithmIJMIT JOURNAL
The clustering is a without monitoring process and one of the most common data mining techniques. The
purpose of clustering is grouping similar data together in a group, so were most similar to each other in a
cluster and the difference with most other instances in the cluster are. In this paper we focus on clustering
partition k-means, due to ease of implementation and high-speed performance of large data sets, After 30
year it is still very popular among the developed clustering algorithm and then for improvement problem of
placing of k-means algorithm in local optimal, we pose extended PSO algorithm, that its name is ECPSO.
Our new algorithm is able to be cause of exit from local optimal and with high percent produce the
problem’s optimal answer. The probe of results show that mooted algorithm have better performance
regards as other clustering algorithms specially in two index, the carefulness of clustering and the quality
of clustering.
Similar to (2016)application of parallel glowworm swarm optimization algorithm for data clustering on large datasets (20)
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSEDuvanRamosGarzon1
AIRCRAFT GENERAL
The Single Aisle is the most advanced family aircraft in service today, with fly-by-wire flight controls.
The A318, A319, A320 and A321 are twin-engine subsonic medium range aircraft.
The family offers a choice of engines
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Automobile Management System Project Report.pdfKamal Acharya
The proposed project is developed to manage the automobile in the automobile dealer company. The main module in this project is login, automobile management, customer management, sales, complaints and reports. The first module is the login. The automobile showroom owner should login to the project for usage. The username and password are verified and if it is correct, next form opens. If the username and password are not correct, it shows the error message.
When a customer search for a automobile, if the automobile is available, they will be taken to a page that shows the details of the automobile including automobile name, automobile ID, quantity, price etc. “Automobile Management System” is useful for maintaining automobiles, customers effectively and hence helps for establishing good relation between customer and automobile organization. It contains various customized modules for effectively maintaining automobiles and stock information accurately and safely.
When the automobile is sold to the customer, stock will be reduced automatically. When a new purchase is made, stock will be increased automatically. While selecting automobiles for sale, the proposed software will automatically check for total number of available stock of that particular item, if the total stock of that particular item is less than 5, software will notify the user to purchase the particular item.
Also when the user tries to sale items which are not in stock, the system will prompt the user that the stock is not enough. Customers of this system can search for a automobile; can purchase a automobile easily by selecting fast. On the other hand the stock of automobiles can be maintained perfectly by the automobile shop manager overcoming the drawbacks of existing system.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Democratizing Fuzzing at Scale by Abhishek Aryaabh.arya
Presented at NUS: Fuzzing and Software Security Summer School 2024
This keynote talks about the democratization of fuzzing at scale, highlighting the collaboration between open source communities, academia, and industry to advance the field of fuzzing. It delves into the history of fuzzing, the development of scalable fuzzing platforms, and the empowerment of community-driven research. The talk will further discuss recent advancements leveraging AI/ML and offer insights into the future evolution of the fuzzing landscape.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Vaccine management system project report documentation..pdfKamal Acharya
The Division of Vaccine and Immunization is facing increasing difficulty monitoring vaccines and other commodities distribution once they have been distributed from the national stores. With the introduction of new vaccines, more challenges have been anticipated with this additions posing serious threat to the already over strained vaccine supply chain system in Kenya.
(2016)application of parallel glowworm swarm optimization algorithm for data clustering on large datasets
1. Research Article
INTRODUCTION
Clustering of large unstructured data sets is the main task in
analyzing the data. The main objective of analyzing data is to
divide the unstructured data sets into different divisions and
each division is named as Clusters.Each cluster will be having
common speci
fications between the cluster members. An
efficient clustering algorithm is one that yields clusters with
very high quality, that is, the measures to check the similarity
between the data objects in a particular cluster has to be
maximized, the measures to check the similarity between the
data objects from different clusters has to be minimal.
Hence,clustering of large unstructured data sets can be defined
as a optimization problem with multiple objectives.Many
algorithms have been introduced to resolve the problems with
optimized data clustering such as glowworm swarm
intelligence. Glowworm swarm intelligence simulates the
natural behavior of the swarms such as birds flock, colonies of
ants,large number of fish together and growth of bacteria.
Considering the natural behavior of the swarm, the
members of the swarm sense the interaction of each member in
the swarm and share the information with each other, the
information such as helping each other to reach the food
1
Reva Institute of Technology and Management, Rukmini Knowledge
Park, Kattigenahalli, Yelahanka, Near Border Security Bustop,
Bengaluru, Karnataka-560064, India.
E-mail: hajiramansoor72@gmail.com
*Corresponding author
resources. All the swarm members participation is required to
achieve the goal instead being a central member in the swarm
to co-ordinate all the swarm members. Some examples of
swarm intelligence algorithms are Glowworm Swarm
Optimization, Particle Swarm Optimization and Ant Colony
Optimization.Clustering large amount of data is one of the
challenging tasks in many application areas like social
networking, bioinformatics and many others. Many
Traditional clustering algorithms needs modification to handle
the large data sizes. In this work, the algorithm on
Optimization of Glowworm Swarm for clustering of datais
designed and implemented tohandle large unstructured data
sets. The proposed algorithm uses optimized glowworm swarm
to evaluate the clustering algorithm. Glowworm Swarm
Optimization for data clustering algorithm is used as it is very
advantageous in resolving multimodal problems, which in
terms of clustering means finding multiple centroids. The
algorithmuses the Map and Reduce methodology for the
parallelization since it provides balancing of load in a parallel
fashion, localizing the unstructured data set and tolerant
towards faults. The experimental results shows that Glowworm
Swarm Optimization for data clustering algorithm scales better
with increase in data set sizes and achieves a very close to linear
speed with maintenance of the clustering.
The paper is organized with the following sections: Section
II briefly describesthe related works carried out in the domain
of analyzing the large datasets. Section III describes the
problem definition, the existing system and the proposed system
Application of Parallel Glowworm Swarm Optimization Algorithm
for Data Clustering on Large Datasets
Hajira Tabasum M,1*
Akram Pasha1
Abstract: Data Analyzing is the primary task with unstructured large data sets which is the major concern in many application areas.
Many data analyzing algorithms need to be modified in order to handle the large data sizes efficiently. In this work, Glowworm Swarm
Optimization for data clustering algorithm is designed and implemented to handle unstructured large data sets. The algorithm uses the
optimized glowworm swarm to evaluate the data analyzing problem. The algorithm for data analyzing is used as it is very advantageous
in resolving problems with multimodality, which in terms of clustering means finding the number of centroids. Glowworm Swarm
Optimization for data clustering algorithm uses the Map Reduce methodology for the parallelization since it provides balancing the load
in a parallel fashion, localizing the data and tolerant towards fault. The experiments conducted on various data sets shows that
scalability with large unstructured data sets and achieves a better performance with data clustering, thus proving the Glowworm Swarm
Optimization for data clustering algorithm is efficient compared to the traditional algorithms on clustering of large unstructured data sets.
Asian Journal of Engineering and Technology Innovation
Volume 4, Issue 7
Published on: 7/05/2016
Cite this article as: Hajira Tabasum M, Akram Pasha. Application of Parallel Glowworm Swarm Optimization Algorithm for Data
Clustering on Large Datasets. Asian Journal of Engineering and Technology Innovation, Vol 4(7): 34-39, 2016.
34
2. Research Article
explained using glowworm clustering algorithm process and
the Map Reduce programming framework.Section IV explains
the experiments conducted on various data sets and the results
obtained. The work proposed is concluded in Section V.
LITERATURE SURVEY
This section reviews the related work carried out in the domain
of analyzing the large datasets, considering the data clustering
to be a prime problem.
The work proposed by R. Eberhart and J. Kennedy in [1]
introduced a new method for optimizing the continuous
nonlinear functions. This method was introduced through
simulation of a very simple social model; thus the social
metaphor is considered, and the algorithm stands without
metaphorical support. This work specifically describes the
particle swarm optimization concept in terms of its
precursors, taking into consideration the stages of its
development from social simulation to optimizer. Particle
swarm optimization has roots in two main component
methodologies. Perhaps more obvious are its ties to artificial
life in general, and to fish schooling, bird flocking and
particularly the swarming theory. It is also related to
evolutionary computation, and is tied to both evolutionary
programming and genetic algorithms. These relationships are
briefly described in this paper.
Particle swarm methodology was introduced with the
concept for optimization of nonlinear functions. Several
paradigms were evolved and are outlined, and the
implementation of Benchmark testing paradigm is discussed.
Applications including trainings on neural network,
optimization of nonlinear functions and benchmark testing of
the paradigm is described, and are proposed. The relationships
between optimization of particle swarm methodology and both
genetic algorithms and artificial life are described.
Glow worm Initialization Centroid Confirm reduce Task
Centroid Search Map Task
Data to Cluster set
Centriod Selection
Clustering Movement Clustered Data Set
Figure 1: Architecture of proposed system
Figure 2: Execution of glowworm clustering algorithm
35
3. Research Article
In the work proposed by Z. Weizhong, M. Huifang, and H.
Qing in [2], a k-means parallel clustering algorithm using a
programming model Map Reduce was introduced, which is a
simple yet powerful parallel programming technique. The
results through different experiments demonstrate that the
algorithm proposed can efficiently process large data sets and
scale well on commodity hardware. Clustering of data has been
received considerable attention in many applications, such as
document retrieval, data mining,pattern classification and
image segmentation. The large volumes of information evolved
by the progress of technology, makes clustering of very large
scale of data a challenging task. Many renowned researchers
tried their best to design very efficient parallel clustering
algorithms In order to deal with these data clustering problems.
Map and Reduce functions are part of a programming
model on a Hadoop platform is effectively used for the
associated implementation of processing and generation of
large datasets that is relatively clustered to cater all the real-
world tasks. The computation is specified in terms of the
functions map and reduce, and the underlying runtime
systemparallelizes automatically the computation across large-
scale data clusters of machines.
In this work, k-means algorithm is adapted in Map and
Reduce framework which in turn is implemented by Hadoop to
make the data clustering method applicable to large scale of
data sets. By applying proper pairs, the proposed algorithm can
be parallel executed effectively. Comprehensive experiments
were conducted to evaluate the proposed algorithm. The
experimental results demonstrate that the algorithm
introduced can effectively deal with large scale unlabeled
datasets.
In the work proposed by I. Aljarah and Ludwig S. in [3],
algorithm on glowworm swarms introduced as an optimization
algorithm inspired by nature simulating the natural behavior of
the lighting worms. Glowworm algorithm has many
applications in the problems requiring multiple solutions in
huge search space, having equal or different objective function
values. Therefore, in this work, a data clustering algorithm
Figure 3: Output of glowworm clustering algorithm
Figure 4: Performance comparison of DB Scan clustering and glowworm clusterin
36
4. Research Article
based on glowworm is proposed, in which the glowworm
algorithm is modified to solve all the data clustering problem
and to effectively locate multiple optimal centroids based on
the multimode search capabilities of the glowworm algorithm.
The optimized algorithm on glowworm for data clustering
algorithm ensures that the similarity between the members
within a cluster is maximum and the similarity among
members from different clusters is minimal.
Fitness functions are proposed to evaluate the goodness of
the members of glowworm algorithm in achieving better
quality data clusters. The proposed algorithm is tested by real-
world and artificial large unstructured data sets. The better
performance of the proposed algorithm over popular clustering
algorithms is demonstrated using many data sets. The results
shows that Glowworm Swarm Optimization for data clustering
algorithm can efficiently be used for clustering of large data
sets.
This work furtherextends the functionality of Glowworm
swarm optimization to solve the clustering problem, taking
into account the advantages of the glowworm swarm
optimization algorithm multimode search capability to locate
and optimize centroids. Subsequently, the proposed algorithm
is capable of identifying the number of clusters to be generated
without specifying the number explicitly in advance. Moreover,
three different fitness functions are introduced to add
robustness and flexibility to the proposed algorithm.
In the work proposed by He, H. Tan, Luo, S. Feng, and J.
Fan in [4], DBSCAN (density-based spatial clustering of noise
based applications ), a very important spatial data clustering
method was introduced, which is widely adopted in a number
of applications. The size of datasets is extremely large in these
days the processing of very complex data sets analysis in
parallel such as density based algorithm becomes obsolete.
However, there are three major disadvantages in the existing
approach of parallel processing of density based spatial
algorithms. First, density based spatial algorithms are not
effective in properly balancing the load among tasks in parallel,
specifically when data are heavily skewed. Secondly, the
algorithms are limited in scalability because all the critical sub-
procedures are not parallelized. Third, the algorithms are
notdesigned primarily for shared-nothing environments,
which limit the portability to emerging parallel processing
paradigms.
However, in this work, MR-density based algorithm is
presentedas a scalable density based spatial algorithm by
unleashing the inherent parallel distributed computing
framework, such as Map and Reduce programming model.
In this algorithm, all the critical sub-procedures are
parallelized completely. As such, there is no performance
bottleneck caused by sequential processing. Mainly, a novel
data partitioning technique based on estimation and
computation cost is proposed. The primary objective is to
achieve desired load balancing even in the context of heavily
skewed data.
PROBLEM DEFINITION
This section describes the problem associated with analyzing
associated with clustering of large unstructured data.
Clustering large unstructured data is one of the primary tasks
recently known which is used in many application areas such
as data analysis associated with banking transactions, data
analysis associated with social networking sites, and many
more application areas.
Clustering of unstructured data is the important data
analyzing tasks with the main objective of dividing a set of
unstructured data objects into different groups named as
clusters; each cluster is having some specifications in
common between the cluster members. Clustering very large
data sets that contain large numbers of unlabeled records with
very high dimensions is very difficult and computationally
expensive.
EXISTING SYSTEM
The parallelization of the algorithm is the solution to enable
existing approaches to work feasibly on big data, which can be
carried out using different methodologies such as Message
Passing Interface, or Map Reduce, and many others. Map
Reduce is a prominent parallel processing of data framework,
which has been gaining significant interest from both academia
and industry.
Map Reduce data framework enables users to develop large-
scale distributed applications efficiently by supporting load
balancing, fault tolerance and data locality.
PROPOSED SYSTEM
We propose a scalable design and implementation of
glowworm swarm optimization clustering using the Map
Reduce methodology called Glowworm Swarm Optimization
for data clustering algorithm with Map Reduce. Glowworm
Swarm Optimization for data clustering algorithm with Map
Reduce is different from Glowworm Swarm Optimization for
data clustering algorithm as the algorithm implements the Map
and Reduce functions in order to achieve the main goal of
solving big data clustering problems and enhancement in
itsscalability.
The Modules in this Project as Shown Below
Glowworm Initialization: The unlabeled data is taken for
clustering. In this technique, initially a swarm of agents in a
search space are distributed in a random fashion. The agents
are considered as glowworms which carry a good quantity of
luminescence called luciferin with them.
37
5. Research Article
The glowworms emit a light whose intensity is proportional
to the associated luciferin and interacts with other agents
available within a variable neighborhood. The glowworm
identifies its neighbors and calculates its movements by
exploiting an adaptive neighborhood, which is bounded above
by its sensor range.
Centroid Selection: The swarm agents are selected as
centroid to reduce the tasks. Here map and reduce functions
are executed. Clustering Movement: In this module, the
various agents are clustered and moved.
EXPERIMENTS AND RESULTS
Map Reduce programming data framework is a highly scalable
and robust technique and can be used across many computer
nodes in parallel, and is mostly applicable for intense data
applications when there are some limitations on multiple
processing and with large shared-memory machines. Map
Reduce programming data modelling framework utilizes two
main functions on Hadoop: Map and Reduce. Both Map
function and Reduce function takes inputs and produce
outputs in the form of <key, value>. The Map function moves
over a large number of records and extracts interesting
information from each record, and then all values with the
same key is provided as input to the same Reduce function.
Moreover, the Reduce function aggregates intermediate results,
generated from the Map function that has the same key, and
then generates the final results.
This section illustrates the algorithm used for the
implementation in achieving the desired outcome.
Table 1: Algorithm Glowworm
Input : Data Set
Output : Clustered Data
Steps
1) Load the Data Set
2)Initialize the swarms to a random position
3)Centroid Selection based on sub coverage distance
4)Perform Map and reduce tasks for each iteration
5) Find the fitness value
6)Cluster the data based on sub coverage-distance and fitness
value
The following snapshots (Given after conclusion) define the
results or outputs that is produced after step by step execution
of all the modules of the system.
The above graph illustrates the Time versus Data size taken
for DB Scan Clustering and Glowworm Clustering algorithms
CONCLUSION
In this work, efficient data clustering algorithm is introduced
which is highly scalable. The algorithm is designed and
implemented on Hadoop with the use of Map and Reduce
programming model. The algorithm is optimized for clustering
of large unstructured data sets, furthermore the processing
time required to analyze large unstructured data sets is
significantly higher. Therefore, the algorithmis implemented
on Hadoop with Map and Reduce programming framework to
overcome the inefficiency with processing time on large
unstructured data sets.The optimized algorithm on glowworm
for data clustering illustrates that Glowworm Swarm
Optimization for data clustering algorithm can efficiently be
parallelized with Map and Reduce programming model to
process very large data sets.The experimental results proved
that the proposed algorithm yields high accuracy and
scalability used with very large data sets.
Snapshots
REFERENCES AND NOTES
1. Impetus white paper, March, 2011, “Planning
Hadoop/NoSQL Projects for 2011” by Technologies,
38
6. Research Article
Available:http://www.techrepublic.com/whitepapers/Planni
nghadoopnosql-projects-for-2011/2923717, March,2011.
2. Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau,
David E. Culler, Joseph M. Hellerstein, and David A.
Patterson. High-performance sorting on networks of
workstations. In Proceedings of the 1997 ACM SIGMOD
International Conference on Management of Data, Tucson,
Arizona, May 1997.
3. Jeffrey Dean and Sanjay Ghemawat, “MapReduce:
Simplified Data Processing on Large Clusters”, Google, Inc.
4. Zibin Zheng, Jieming Zhu, and Michael R. Lyu , “Service-
generated Big Data and Big Data-as-a-Service: An
Overview” , 978-0-7695-5006-0/13 2013 IEEE.
5. McKinsey Global Institute, 2011, Big Data: The next
frontier for innovation, competition, and productivity,
Available:www.mckinsey.com/~/media/McKinsey/dotcom/
Insights%20and%20pubs/MGI/Research/Technology%20a
nd%20Innovation/Big%20Data/MGI_big_data_full_report.
ashx, Aug, 2012.
6. Aditya B. Patel, Manashvi Birla, Ushma Nair, “Addressing
Big Data Problem Using Hadoop and Map Reduce”,
NUiCONE-2012, 06- 08DECEMBER, 2012.
7. Apache Software Foundation. Official apache hadoop
website, http://hadoop.apache.org/, Aug, 2012.
8. Hung-Chih Yang, Ali Dasdan, Ruey-Lung Hsiao, and
D.StottParker from Yahoo and UCLA, "Map-Reduce-
Merge: Simplified Data Processing on Large Clusters",
paper published in Proc. of ACM SIGMOD, pp. 1029–1040,
2007.
9. H. Mi, H.Wang, Y. Zhou, M. R. Lyu, and H. Cai, “Towards
fine-grained, unsupervised, scalable performance diagnosis
forproduction cloud computing systems,”IEEE Transaction
on Parallel and DistributedSystems, no.PrePrints, 2013.
10. M. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer,
“Pinpoint: problem determination in large, dynamic
internet services”, in Proceedingof the International
Conference on Dependable Systems andNetworks
(DSN’02), pp. 595–604.
11. B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson,
M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag, “Dapper,
a large-scaledistributed systems tracing infrastructure,”
Google, Inc., Tech. Rep., 2010.
12. S. Lohr, “The age of big data,” New York Times, vol. 11,
2012.
13. “Challenges and opportunities with big data,” leading
Researchers across the United States, Tech. Rep., 2011.
14. E. Slack, “Storage infrastructures for big data workflows,”
Storage Switchland, LLC, Tech. Rep., 2012.
15. “Big data-as-a-service: A market and technology
perspective,” EMC Solution Group, Tech. Rep., 2012.
16. J. Horey, E. Begoli, R. Gunasekaran, S.-H. Lim, and J.
Nutaro, “Big data platforms as a service: challenges and
approach,” in Proceedings of the 4th USENIX conference
on Hot Topics in Cloud Ccomputing, ser. HotCloud’12,
2012, pp. 16–16.
17. “Why big data analytics as a service?”
”http://www.analyticsasaservice.org/why-big-data-
analytics-as- aservice/”, August 2012.
18. P. O’Brien, “The future: Big data apps or web services?”
”http://blog.fliptop.com/blog/2012/05/12/the-future-big-
data-appsor- web-services/”, 2013.
19. Apache Cassandra, http://cassandra.apache.org.
20. N. Lynch and S. Gilbert, “Brewer’s conjecture and the
feasibility of consistent, available, partition-tolerant web
services,” SIGACT, 2002.
21. RevolutionAnalyltics,https://github.com/RevolutionAnalyti
cs/RHadoop/ wiki.
39