The conventional clustering algorithms mine static databases and generate a set of patterns in the form of
clusters. Many real life databases keep growing incrementally. For such dynamic databases, the patterns
extracted from the original database become obsolete. Thus the conventional clustering algorithms are not
suitable for incremental databases due to lack of capability to modify the clustering results in accordance
with recent updates. In this paper, the author proposes a new incremental clustering algorithm called
CFICA(Cluster Feature-Based Incremental Clustering Approach for numerical data) to handle numerical
data and suggests a new proximity metric called Inverse Proximity Estimate (IPE) which considers the
proximity of a data point to a cluster representative as well as its proximity to a farthest point in its vicinity.
CFICA makes use of the proposed proximity metric to determine the membership of a data point into a
cluster.
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEIJDKP
Metadata represents the information about data to be stored in Data Warehouses. It is a mandatory
element of Data Warehouse to build an efficient Data Warehouse. Metadata helps in data integration,
lineage, data quality and populating transformed data into data warehouse. Spatial data warehouses are
based on spatial data mostly collected from Geographical Information Systems (GIS) and the transactional
systems that are specific to an application or enterprise. Metadata design and deployment is the most
critical phase in building of data warehouse where it is mandatory to bring the spatial information and
data modeling together. In this paper, we present a holistic metadata framework that drives metadata
creation for spatial data warehouse. Theoretically, the proposed metadata framework improves the
efficiency of accessing of data in response to frequent queries on SDWs. In other words, the proposed
framework decreases the response time of the query and accurate information is fetched from Data
Warehouse including the spatial information
With the rapid development in Geographic Information Systems (GISs) and their applications, more and
more geo-graphical databases have been developed by different vendors. However, data integration and
accessing is still a big problem for the development of GIS applications as no interoperability exists among
different spatial databases. In this paper we propose a unified approach for spatial data query. The paper
describes a framework for integrating information from repositories containing different vector data sets
formats and repositories containing raster datasets. The presented approach converts different vector data
formats into a single unified format (File Geo-Database “GDB”). In addition, we employ “metadata” to
support a wide range of users’ queries to retrieve relevant geographic information from heterogeneous and
distributed repositories. Such an employment enhances both query processing and performance.
With the development of database, the data volume stored in database increases rapidly and in the large
amounts of data much important information is hidden. If the information can be extracted from the
database they will create a lot of profit for the organization. The question they are asking is how to extract
this value. The answer is data mining. There are many technologies available to data mining practitioners,
including Artificial Neural Networks, Genetics, Fuzzy logic and Decision Trees. Many practitioners are
wary of Neural Networks due to their black box nature, even though they have proven themselves in many
situations. This paper is an overview of artificial neural networks and questions their position as a
preferred tool by data mining practitioners.
Enhancement techniques for data warehouse staging areaIJDKP
Poor performance can turn a successful data warehousing project into a failure. Consequently, several
attempts have been made by various researchers to deal with the problem of scheduling the Extract-
Transform-Load (ETL) process. In this paper we therefore present several approaches in the context of
enhancing the data warehousing Extract, Transform and loading stages. We focus on enhancing the
performance of extract and transform phases by proposing two algorithms that reduce the time needed in
each phase through employing the hidden semantic information in the data. Using the semantic
information, a large volume of useless data can be pruned in early design stage. We also focus on the
problem of scheduling the execution of the ETL activities, with the goal of minimizing ETL execution time.
We explore and invest in this area by choosing three scheduling techniques for ETL. Finally, we
experimentally show their behavior in terms of execution time in the sales domain to understand the impact
of implementing any of them and choosing the one leading to maximum performance enhancement.
Clustering the results of a search helps the user to overview the information returned. In this paper, we
look upon the clustering task as cataloguing the search results. By catalogue we mean a structured label
list that can help the user to realize the labels and search results. Labelling Cluster is crucial because
meaningless or confusing labels may mislead users to check wrong clusters for the query and lose extra
time. Additionally, labels should reflect the contents of documents within the cluster accurately. To be able
to label clusters effectively, a new cluster labelling method is introduced. More emphasis was given to
/produce comprehensible and accurate cluster labels in addition to the discovery of document clusters. We
also present a new metric that employs to assess the success of cluster labelling. We adopt a comparative
evaluation strategy to derive the relative performance of the proposed method with respect to the two
prominent search result clustering methods: Suffix Tree Clustering and Lingo.
we perform the experiments using the publicly available Datasets Ambient and ODP-239
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...IJDKP
Many applications of automatic document classification require learning accurately with little training
data. The semi-supervised classification technique uses labeled and unlabeled data for training. This
technique has shown to be effective in some cases; however, the use of unlabeled data is not always
beneficial.
On the other hand, the emergence of web technologies has originated the collaborative development of
ontologies. In this paper, we propose the use of ontologies in order to improve the accuracy and efficiency
of the semi-supervised document classification.
We used support vector machines, which is one of the most effective algorithms that have been studied for
text. Our algorithm enhances the performance of transductive support vector machines through the use of
ontologies. We report experimental results applying our algorithm to three different datasets. Our
experiments show an increment of accuracy of 4% on average and up to 20%, in comparison with the
traditional semi-supervised model.
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEIJDKP
Metadata represents the information about data to be stored in Data Warehouses. It is a mandatory
element of Data Warehouse to build an efficient Data Warehouse. Metadata helps in data integration,
lineage, data quality and populating transformed data into data warehouse. Spatial data warehouses are
based on spatial data mostly collected from Geographical Information Systems (GIS) and the transactional
systems that are specific to an application or enterprise. Metadata design and deployment is the most
critical phase in building of data warehouse where it is mandatory to bring the spatial information and
data modeling together. In this paper, we present a holistic metadata framework that drives metadata
creation for spatial data warehouse. Theoretically, the proposed metadata framework improves the
efficiency of accessing of data in response to frequent queries on SDWs. In other words, the proposed
framework decreases the response time of the query and accurate information is fetched from Data
Warehouse including the spatial information
With the rapid development in Geographic Information Systems (GISs) and their applications, more and
more geo-graphical databases have been developed by different vendors. However, data integration and
accessing is still a big problem for the development of GIS applications as no interoperability exists among
different spatial databases. In this paper we propose a unified approach for spatial data query. The paper
describes a framework for integrating information from repositories containing different vector data sets
formats and repositories containing raster datasets. The presented approach converts different vector data
formats into a single unified format (File Geo-Database “GDB”). In addition, we employ “metadata” to
support a wide range of users’ queries to retrieve relevant geographic information from heterogeneous and
distributed repositories. Such an employment enhances both query processing and performance.
With the development of database, the data volume stored in database increases rapidly and in the large
amounts of data much important information is hidden. If the information can be extracted from the
database they will create a lot of profit for the organization. The question they are asking is how to extract
this value. The answer is data mining. There are many technologies available to data mining practitioners,
including Artificial Neural Networks, Genetics, Fuzzy logic and Decision Trees. Many practitioners are
wary of Neural Networks due to their black box nature, even though they have proven themselves in many
situations. This paper is an overview of artificial neural networks and questions their position as a
preferred tool by data mining practitioners.
Enhancement techniques for data warehouse staging areaIJDKP
Poor performance can turn a successful data warehousing project into a failure. Consequently, several
attempts have been made by various researchers to deal with the problem of scheduling the Extract-
Transform-Load (ETL) process. In this paper we therefore present several approaches in the context of
enhancing the data warehousing Extract, Transform and loading stages. We focus on enhancing the
performance of extract and transform phases by proposing two algorithms that reduce the time needed in
each phase through employing the hidden semantic information in the data. Using the semantic
information, a large volume of useless data can be pruned in early design stage. We also focus on the
problem of scheduling the execution of the ETL activities, with the goal of minimizing ETL execution time.
We explore and invest in this area by choosing three scheduling techniques for ETL. Finally, we
experimentally show their behavior in terms of execution time in the sales domain to understand the impact
of implementing any of them and choosing the one leading to maximum performance enhancement.
Clustering the results of a search helps the user to overview the information returned. In this paper, we
look upon the clustering task as cataloguing the search results. By catalogue we mean a structured label
list that can help the user to realize the labels and search results. Labelling Cluster is crucial because
meaningless or confusing labels may mislead users to check wrong clusters for the query and lose extra
time. Additionally, labels should reflect the contents of documents within the cluster accurately. To be able
to label clusters effectively, a new cluster labelling method is introduced. More emphasis was given to
/produce comprehensible and accurate cluster labels in addition to the discovery of document clusters. We
also present a new metric that employs to assess the success of cluster labelling. We adopt a comparative
evaluation strategy to derive the relative performance of the proposed method with respect to the two
prominent search result clustering methods: Suffix Tree Clustering and Lingo.
we perform the experiments using the publicly available Datasets Ambient and ODP-239
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...IJDKP
Many applications of automatic document classification require learning accurately with little training
data. The semi-supervised classification technique uses labeled and unlabeled data for training. This
technique has shown to be effective in some cases; however, the use of unlabeled data is not always
beneficial.
On the other hand, the emergence of web technologies has originated the collaborative development of
ontologies. In this paper, we propose the use of ontologies in order to improve the accuracy and efficiency
of the semi-supervised document classification.
We used support vector machines, which is one of the most effective algorithms that have been studied for
text. Our algorithm enhances the performance of transductive support vector machines through the use of
ontologies. We report experimental results applying our algorithm to three different datasets. Our
experiments show an increment of accuracy of 4% on average and up to 20%, in comparison with the
traditional semi-supervised model.
A statistical data fusion technique in virtual data integration environmentIJDKP
Data fusion in the virtual data integration environment starts after detecting and clustering duplicated
records from the different integrated data sources. It refers to the process of selecting or fusing attribute
values from the clustered duplicates into a single record representing the real world object. In this paper, a
statistical technique for data fusion is introduced based on some probabilistic scores from both data
sources and clustered duplicates
Recommendation system using bloom filter in mapreduceIJDKP
Many clients like to use the Web to discover product details in the form of online reviews. The reviews are
provided by other clients and specialists. Recommender systems provide an important response to the
information overload problem as it presents users more practical and personalized information facilities.
Collaborative filtering methods are vital component in recommender systems as they generate high-quality
recommendations by influencing the likings of society of similar users. The collaborative filtering method
has assumption that people having same tastes choose the same items. The conventional collaborative
filtering system has drawbacks as sparse data problem & lack of scalability. A new recommender system is
required to deal with the sparse data problem & produce high quality recommendations in large scale
mobile environment. MapReduce is a programming model which is widely used for large-scale data
analysis. The described algorithm of recommendation mechanism for mobile commerce is user based
collaborative filtering using MapReduce which reduces scalability problem in conventional CF system.
One of the essential operations for the data analysis is join operation. But MapReduce is not very
competent to execute the join operation as it always uses all records in the datasets where only small
fraction of datasets are applicable for the join operation. This problem can be reduced by applying
bloomjoin algorithm. The bloom filters are constructed and used to filter out redundant intermediate
records. The proposed algorithm using bloom filter will reduce the number of intermediate results and will
improve the join performance.
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
Apriori is one of the key algorithms to generate frequent itemsets. Analysing frequent itemset is a crucial
step in analysing structured data and in finding association relationship between items. This stands as an
elementary foundation to supervised learning, which encompasses classifier and feature extraction
methods. Applying this algorithm is crucial to understand the behaviour of structured data. Most of the
structured data in scientific domain are voluminous. Processing such kind of data requires state of the art
computing machines. Setting up such an infrastructure is expensive. Hence a distributed environment
such as a clustered setup is employed for tackling such scenarios. Apache Hadoop distribution is one of
the cluster frameworks in distributed environment that helps by distributing voluminous data across a
number of nodes in the framework. This paper focuses on map/reduce design and implementation of
Apriori algorithm for structured data analysis.
One of the most important problems in modern finance is finding efficient ways to summarize and visualize
the stock market data to give individuals or institutions useful information about the market behavior for
investment decisions Therefore, Investment can be considered as one of the fundamental pillars of national
economy. So, at the present time many investors look to find criterion to compare stocks together and
selecting the best and also investors choose strategies that maximize the earning value of the investment
process. Therefore the enormous amount of valuable data generated by the stock market has attracted
researchers to explore this problem domain using different methodologies. Therefore research in data
mining has gained a high attraction due to the importance of its applications and the increasing generation
information. So, Data mining tools such as association rule, rule induction method and Apriori algorithm
techniques are used to find association between different scripts of stock market, and also much of the
research and development has taken place regarding the reasons for fluctuating Indian stock exchange.
But, now days there are two important factors such as gold prices and US Dollar Prices are more
dominating on Indian Stock Market and to find out the correlation between gold prices, dollar prices and
BSE index statistical correlation is used and this helps the activities of stock operators, brokers, investors
and jobbers. They are based on the forecasting the fluctuation of index share prices, gold prices, dollar
prices and transactions of customers. Hence researcher has considered these problems as a topic for
research.
Data mining , knowledge discovery is the process
of analyzing data from different perspectives and summarizing it
into useful information - information that can be used to increase
revenue, cuts costs, or both. Data mining software is one of a
number of analytical tools for analyzing data. It allows users to
analyze data from many different dimensions or angles, categorize
it, and summarize the relationships identified. Technically, data
mining is the process of finding correlations or patterns among
dozens of fields in large relational databases. The goal of
clustering is to determine the intrinsic grouping in a set of
unlabeled data. But how to decide what constitutes a good
clustering? It can be shown that there is no absolute “best”
criterion which would be independent of the final aim of the
clustering. Consequently, it is the user which must supply this
criterion, in such a way that the result of the clustering will suit
their needs.
For instance, we could be interested in finding
representatives for homogeneous groups (data reduction), in
finding “natural clusters” and describe their unknown properties
(“natural” data types), in finding useful and suitable groupings
(“useful” data classes) or in finding unusual data objects (outlier
detection).Of late, clustering techniques have been applied in the
areas which involve browsing the gathered data or in categorizing
the outcome provided by the search engines for the reply to the
query raised by the users. In this paper, we are providing a
comprehensive survey over the document clustering.
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data SetsIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
International Journal of Engineering Research and Applications (IJERA) is a team of researchers not publication services or private publications running the journals for monetary benefits, we are association of scientists and academia who focus only on supporting authors who want to publish their work. The articles published in our journal can be accessed online, all the articles will be archived for real time access.
Our journal system primarily aims to bring out the research talent and the works done by sciaentists, academia, engineers, practitioners, scholars, post graduate students of engineering and science. This journal aims to cover the scientific research in a broader sense and not publishing a niche area of research facilitating researchers from various verticals to publish their papers. It is also aimed to provide a platform for the researchers to publish in a shorter of time, enabling them to continue further All articles published are freely available to scientific researchers in the Government agencies,educators and the general public. We are taking serious efforts to promote our journal across the globe in various ways, we are sure that our journal will act as a scientific platform for all researchers to publish their works online.
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATAcscpconf
In Data mining applications, which often involve complex data like multiple heterogeneous data sources, user preferences, decision-making actions and business impacts etc., the complete useful information cannot be obtained by using single data mining method in the form of informative patterns as that would consume more time and space, if and only if it is possible to join large relevant data sources for discovering patterns consisting of various aspects of useful information. We consider combined mining as an approach for mining informative patterns
from multiple data-sources or multiple-features or by multiple-methods as per the requirements. In combined mining approach, we applied Lossy-counting algorithm on each data-source to get the frequent data item-sets and then get the combined association rules. In multi-feature combined mining approach, we obtained pair patterns and cluster patterns and then generate incremental pair patterns and incremental cluster patterns, which cannot be directly generated by the existing methods. In multi-method combined mining approach, we combine FP-growth and Bayesian Belief Network to make a classifier to get more informative knowledge.
Combined mining approach to generate patterns for complex datacsandit
In Data mining applications, which often involve complex data like multiple heterogeneous data
sources, user preferences, decision-making actions and business impacts etc., the complete
useful information cannot be obtained by using single data mining method in the form of
informative patterns as that would consume more time and space, if and only if it is possible to
join large relevant data sources for discovering patterns consisting of various aspects of useful
information. We consider combined mining as an approach for mining informative patterns
from multiple data-sources or multiple-features or by multiple-methods as per the requirements.
In combined mining approach, we applied Lossy-counting algorithm on each data-source to get
the frequent data item-sets and then get the combined association rules. In multi-feature
combined mining approach, we obtained pair patterns and cluster patterns and then generate
incremental pair patterns and incremental cluster patterns, which cannot be directly generated
by the existing methods. In multi-method combined mining approach, we combine FP-growth
and Bayesian Belief Network to make a classifier to get more informative knowledge.
Frequent Item set Mining of Big Data for Social MediaIJERA Editor
Big data is a term for massive data sets having large, more varied and complex structure with the difficulties of storing, analyzing and visualizing for further processes or results. Bigdata includes data from email, documents, pictures, audio, video files, and other sources that do not fit into a relational database. This unstructured data brings enormous challenges to Bigdata.The process of research into massive amounts of data to reveal hidden patterns and secret correlations named as big data analytics. Therefore, big data implementations need to be analyzed and executed as accurately as possible. The proposed model structures the unstructured data from social media in a structured form so that data can be queried efficiently by using Hadoop MapReduce framework. The Bigdata mining is essential in order to extract value from massive amount of data. MapReduce is efficient method to deal with Big data than traditional techniques.The proposed Linguistic string matching Knuth-Morris-Pratt algorithm and K-Means clustering algorithm gives proper platform to extract value from massive amount of data and recommendation for user.Linguistic matching techniques such as Knuth–Morris–Pratt string matching algorithm are very useful in giving proper matching output to user query. The K-Means algorithm is one which works on clustering data using vector space model. It can be an appropriate method to produce recommendation for user
DECISION TREE CLUSTERING: A COLUMNSTORES TUPLE RECONSTRUCTIONcscpconf
Column-Stores has gained market share due to promising physical storage alternative for analytical queries. However, for multi-attribute queries column-stores pays performance
penalties due to on-the-fly tuple reconstruction. This paper presents an adaptive approach for reducing tuple reconstruction time. Proposed approach exploits decision tree algorithm to
cluster attributes for each projection and also eliminates frequent database scanning.Experimentations with TPC-H data shows the effectiveness of proposed approach.
CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...ijcsit
This paper introduces a novel approach for efficient video categorization. It relies on two main
components. The first one is a new relational clustering technique that identifies video key frames by
learning cluster dependent Gaussian kernels. The proposed algorithm, called clustering and Local Scale
Learning algorithm (LSL) learns the underlying cluster dependent dissimilarity measure while finding
compact clusters in the given dataset. The learned measure is a Gaussian dissimilarity function defined
with respect to each cluster. We minimize one objective function to optimize the optimal partition and the
cluster dependent parameter. This optimization is done iteratively by dynamically updating the partition
and the local measure. The kernel learning task exploits the unlabeled data and reciprocally, the
categorization task takes advantages of the local learned kernel. The second component of the proposed
video categorization system consists in discovering the video categories in an unsupervised manner using
the proposed LSL. We illustrate the clustering performance of LSL on synthetic 2D datasets and on high
dimensional real data. Also, we assess the proposed video categorization system using a real video
collection and LSL algorithm.
A statistical data fusion technique in virtual data integration environmentIJDKP
Data fusion in the virtual data integration environment starts after detecting and clustering duplicated
records from the different integrated data sources. It refers to the process of selecting or fusing attribute
values from the clustered duplicates into a single record representing the real world object. In this paper, a
statistical technique for data fusion is introduced based on some probabilistic scores from both data
sources and clustered duplicates
Recommendation system using bloom filter in mapreduceIJDKP
Many clients like to use the Web to discover product details in the form of online reviews. The reviews are
provided by other clients and specialists. Recommender systems provide an important response to the
information overload problem as it presents users more practical and personalized information facilities.
Collaborative filtering methods are vital component in recommender systems as they generate high-quality
recommendations by influencing the likings of society of similar users. The collaborative filtering method
has assumption that people having same tastes choose the same items. The conventional collaborative
filtering system has drawbacks as sparse data problem & lack of scalability. A new recommender system is
required to deal with the sparse data problem & produce high quality recommendations in large scale
mobile environment. MapReduce is a programming model which is widely used for large-scale data
analysis. The described algorithm of recommendation mechanism for mobile commerce is user based
collaborative filtering using MapReduce which reduces scalability problem in conventional CF system.
One of the essential operations for the data analysis is join operation. But MapReduce is not very
competent to execute the join operation as it always uses all records in the datasets where only small
fraction of datasets are applicable for the join operation. This problem can be reduced by applying
bloomjoin algorithm. The bloom filters are constructed and used to filter out redundant intermediate
records. The proposed algorithm using bloom filter will reduce the number of intermediate results and will
improve the join performance.
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
Apriori is one of the key algorithms to generate frequent itemsets. Analysing frequent itemset is a crucial
step in analysing structured data and in finding association relationship between items. This stands as an
elementary foundation to supervised learning, which encompasses classifier and feature extraction
methods. Applying this algorithm is crucial to understand the behaviour of structured data. Most of the
structured data in scientific domain are voluminous. Processing such kind of data requires state of the art
computing machines. Setting up such an infrastructure is expensive. Hence a distributed environment
such as a clustered setup is employed for tackling such scenarios. Apache Hadoop distribution is one of
the cluster frameworks in distributed environment that helps by distributing voluminous data across a
number of nodes in the framework. This paper focuses on map/reduce design and implementation of
Apriori algorithm for structured data analysis.
One of the most important problems in modern finance is finding efficient ways to summarize and visualize
the stock market data to give individuals or institutions useful information about the market behavior for
investment decisions Therefore, Investment can be considered as one of the fundamental pillars of national
economy. So, at the present time many investors look to find criterion to compare stocks together and
selecting the best and also investors choose strategies that maximize the earning value of the investment
process. Therefore the enormous amount of valuable data generated by the stock market has attracted
researchers to explore this problem domain using different methodologies. Therefore research in data
mining has gained a high attraction due to the importance of its applications and the increasing generation
information. So, Data mining tools such as association rule, rule induction method and Apriori algorithm
techniques are used to find association between different scripts of stock market, and also much of the
research and development has taken place regarding the reasons for fluctuating Indian stock exchange.
But, now days there are two important factors such as gold prices and US Dollar Prices are more
dominating on Indian Stock Market and to find out the correlation between gold prices, dollar prices and
BSE index statistical correlation is used and this helps the activities of stock operators, brokers, investors
and jobbers. They are based on the forecasting the fluctuation of index share prices, gold prices, dollar
prices and transactions of customers. Hence researcher has considered these problems as a topic for
research.
Data mining , knowledge discovery is the process
of analyzing data from different perspectives and summarizing it
into useful information - information that can be used to increase
revenue, cuts costs, or both. Data mining software is one of a
number of analytical tools for analyzing data. It allows users to
analyze data from many different dimensions or angles, categorize
it, and summarize the relationships identified. Technically, data
mining is the process of finding correlations or patterns among
dozens of fields in large relational databases. The goal of
clustering is to determine the intrinsic grouping in a set of
unlabeled data. But how to decide what constitutes a good
clustering? It can be shown that there is no absolute “best”
criterion which would be independent of the final aim of the
clustering. Consequently, it is the user which must supply this
criterion, in such a way that the result of the clustering will suit
their needs.
For instance, we could be interested in finding
representatives for homogeneous groups (data reduction), in
finding “natural clusters” and describe their unknown properties
(“natural” data types), in finding useful and suitable groupings
(“useful” data classes) or in finding unusual data objects (outlier
detection).Of late, clustering techniques have been applied in the
areas which involve browsing the gathered data or in categorizing
the outcome provided by the search engines for the reply to the
query raised by the users. In this paper, we are providing a
comprehensive survey over the document clustering.
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data SetsIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
International Journal of Engineering Research and Applications (IJERA) is a team of researchers not publication services or private publications running the journals for monetary benefits, we are association of scientists and academia who focus only on supporting authors who want to publish their work. The articles published in our journal can be accessed online, all the articles will be archived for real time access.
Our journal system primarily aims to bring out the research talent and the works done by sciaentists, academia, engineers, practitioners, scholars, post graduate students of engineering and science. This journal aims to cover the scientific research in a broader sense and not publishing a niche area of research facilitating researchers from various verticals to publish their papers. It is also aimed to provide a platform for the researchers to publish in a shorter of time, enabling them to continue further All articles published are freely available to scientific researchers in the Government agencies,educators and the general public. We are taking serious efforts to promote our journal across the globe in various ways, we are sure that our journal will act as a scientific platform for all researchers to publish their works online.
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATAcscpconf
In Data mining applications, which often involve complex data like multiple heterogeneous data sources, user preferences, decision-making actions and business impacts etc., the complete useful information cannot be obtained by using single data mining method in the form of informative patterns as that would consume more time and space, if and only if it is possible to join large relevant data sources for discovering patterns consisting of various aspects of useful information. We consider combined mining as an approach for mining informative patterns
from multiple data-sources or multiple-features or by multiple-methods as per the requirements. In combined mining approach, we applied Lossy-counting algorithm on each data-source to get the frequent data item-sets and then get the combined association rules. In multi-feature combined mining approach, we obtained pair patterns and cluster patterns and then generate incremental pair patterns and incremental cluster patterns, which cannot be directly generated by the existing methods. In multi-method combined mining approach, we combine FP-growth and Bayesian Belief Network to make a classifier to get more informative knowledge.
Combined mining approach to generate patterns for complex datacsandit
In Data mining applications, which often involve complex data like multiple heterogeneous data
sources, user preferences, decision-making actions and business impacts etc., the complete
useful information cannot be obtained by using single data mining method in the form of
informative patterns as that would consume more time and space, if and only if it is possible to
join large relevant data sources for discovering patterns consisting of various aspects of useful
information. We consider combined mining as an approach for mining informative patterns
from multiple data-sources or multiple-features or by multiple-methods as per the requirements.
In combined mining approach, we applied Lossy-counting algorithm on each data-source to get
the frequent data item-sets and then get the combined association rules. In multi-feature
combined mining approach, we obtained pair patterns and cluster patterns and then generate
incremental pair patterns and incremental cluster patterns, which cannot be directly generated
by the existing methods. In multi-method combined mining approach, we combine FP-growth
and Bayesian Belief Network to make a classifier to get more informative knowledge.
Frequent Item set Mining of Big Data for Social MediaIJERA Editor
Big data is a term for massive data sets having large, more varied and complex structure with the difficulties of storing, analyzing and visualizing for further processes or results. Bigdata includes data from email, documents, pictures, audio, video files, and other sources that do not fit into a relational database. This unstructured data brings enormous challenges to Bigdata.The process of research into massive amounts of data to reveal hidden patterns and secret correlations named as big data analytics. Therefore, big data implementations need to be analyzed and executed as accurately as possible. The proposed model structures the unstructured data from social media in a structured form so that data can be queried efficiently by using Hadoop MapReduce framework. The Bigdata mining is essential in order to extract value from massive amount of data. MapReduce is efficient method to deal with Big data than traditional techniques.The proposed Linguistic string matching Knuth-Morris-Pratt algorithm and K-Means clustering algorithm gives proper platform to extract value from massive amount of data and recommendation for user.Linguistic matching techniques such as Knuth–Morris–Pratt string matching algorithm are very useful in giving proper matching output to user query. The K-Means algorithm is one which works on clustering data using vector space model. It can be an appropriate method to produce recommendation for user
DECISION TREE CLUSTERING: A COLUMNSTORES TUPLE RECONSTRUCTIONcscpconf
Column-Stores has gained market share due to promising physical storage alternative for analytical queries. However, for multi-attribute queries column-stores pays performance
penalties due to on-the-fly tuple reconstruction. This paper presents an adaptive approach for reducing tuple reconstruction time. Proposed approach exploits decision tree algorithm to
cluster attributes for each projection and also eliminates frequent database scanning.Experimentations with TPC-H data shows the effectiveness of proposed approach.
CONTENT BASED VIDEO CATEGORIZATION USING RELATIONAL CLUSTERING WITH LOCAL SCA...ijcsit
This paper introduces a novel approach for efficient video categorization. It relies on two main
components. The first one is a new relational clustering technique that identifies video key frames by
learning cluster dependent Gaussian kernels. The proposed algorithm, called clustering and Local Scale
Learning algorithm (LSL) learns the underlying cluster dependent dissimilarity measure while finding
compact clusters in the given dataset. The learned measure is a Gaussian dissimilarity function defined
with respect to each cluster. We minimize one objective function to optimize the optimal partition and the
cluster dependent parameter. This optimization is done iteratively by dynamically updating the partition
and the local measure. The kernel learning task exploits the unlabeled data and reciprocally, the
categorization task takes advantages of the local learned kernel. The second component of the proposed
video categorization system consists in discovering the video categories in an unsupervised manner using
the proposed LSL. We illustrate the clustering performance of LSL on synthetic 2D datasets and on high
dimensional real data. Also, we assess the proposed video categorization system using a real video
collection and LSL algorithm.
Clustering of high dimensionality data which can be seen in almost all fields these days is becoming
very tedious process. The key disadvantage of high dimensional data which we can pen down is curse
of dimensionality. As the magnitude of datasets grows the data points become sparse and density of
area becomes less making it difficult to cluster that data which further reduces the performance of
traditional algorithms used for clustering. Semi-supervised clustering algorithms aim to improve
clustering results using limited supervision. The supervision is generally given as pair wise
constraints; such constraints are natural for graphs, yet most semi-supervised clustering algorithms are
designed for data represented as vectors [2]. In this paper, we unify vector-based and graph-based
approaches. We first show that a recently-proposed objective function for semi-supervised clustering
based on Hidden Markov Random Fields, with squared Euclidean distance and a certain class of
constraint penalty functions, can be expressed as a special case of the global kernel k-means objective
[3]. A recent theoretical connection between global kernel k-means and several graph clustering
objectives enables us to perform semi-supervised clustering of data. In particular, some methods have
been proposed for semi supervised clustering based on pair wise similarity or dissimilarity
information. In this paper, we propose a kernel approach for semi supervised clustering and present in
detail two special cases of this kernel approach.
Decision Tree Clustering : A Columnstores Tuple Reconstructioncsandit
Column-Stores has gained market share due to promising physical storage alternative for
analytical queries. However, for multi-attribute queries column-stores pays performance
penalties due to on-the-fly tuple reconstruction. This paper presents an adaptive approach for
reducing tuple reconstruction time. Proposed approach exploits decision tree algorithm to
cluster attributes for each projection and also eliminates frequent database scanning.
Experimentations with TPC-H data shows the effectiveness of proposed approach.
Decision tree clustering a columnstores tuple reconstructioncsandit
Column-Stores has gained market share due to promi
sing physical storage alternative for
analytical queries. However, for multi-attribute qu
eries column-stores pays performance
penalties due to on-the-fly tuple reconstruction. T
his paper presents an adaptive approach for
reducing tuple reconstruction time. Proposed approa
ch exploits decision tree algorithm to
cluster attributes for each projection and also eli
minates frequent database scanning.
Experimentations with TPC-H data shows the effectiv
eness of proposed approach.
Data mining is utilized to manage huge measure of information which are put in the data ware houses and databases, to discover required information and data. Numerous data mining systems have been proposed, for example, association rules, decision trees, neural systems, clustering, and so on. It has turned into the purpose of consideration from numerous years. A re-known amongst the available data mining strategies is clustering of the dataset. It is the most effective data mining method. It groups the dataset in number of clusters based on certain guidelines that are predefined. It is dependable to discover the connection between the distinctive characteristics of data.
In k-mean clustering algorithm, the function is being selected on the basis of the relevancy of the function for predicting the data and also the Euclidian distance between the centroid of any cluster and the data objects outside the cluster is being computed for the clustering the data points. In this work, author enhanced the Euclidian distance formula to increase the cluster quality.
The problem of accuracy and redundancy of the dissimilar points in the clusters remains in the improved k-means for which new enhanced approach is been proposed which uses the similarity function for checking the similarity level of the point before including it to the cluster.
A PSO-Based Subtractive Data Clustering AlgorithmIJORCS
There is a tremendous proliferation in the amount of information available on the largest shared information source, the World Wide Web. Fast and high-quality clustering algorithms play an important role in helping users to effectively navigate, summarize, and organize the information. Recent studies have shown that partitional clustering algorithms such as the k-means algorithm are the most popular algorithms for clustering large datasets. The major problem with partitional clustering algorithms is that they are sensitive to the selection of the initial partitions and are prone to premature converge to local optima. Subtractive clustering is a fast, one-pass algorithm for estimating the number of clusters and cluster centers for any given set of data. The cluster estimates can be used to initialize iterative optimization-based clustering methods and model identification methods. In this paper, we present a hybrid Particle Swarm Optimization, Subtractive + (PSO) clustering algorithm that performs fast clustering. For comparison purpose, we applied the Subtractive + (PSO) clustering algorithm, PSO, and the Subtractive clustering algorithms on three different datasets. The results illustrate that the Subtractive + (PSO) clustering algorithm can generate the most compact clustering results as compared to other algorithms.
A h k clustering algorithm for high dimensional data using ensemble learningijitcs
Advances made to the traditional clustering algorithms solves the various problems such as curse of
dimensionality and sparsity of data for multiple attributes. The traditional H-K clustering algorithm can
solve the randomness and apriority of the initial centers of K-means clustering algorithm. But when we
apply it to high dimensional data it causes the dimensional disaster problem due to high computational
complexity. All the advanced clustering algorithms like subspace and ensemble clustering algorithms
improve the performance for clustering high dimension dataset from different aspects in different extent.
Still these algorithms will improve the performance form a single perspective. The objective of the
proposed model is to improve the performance of traditional H-K clustering and overcome the limitations
such as high computational complexity and poor accuracy for high dimensional data by combining the
three different approaches of clustering algorithm as subspace clustering algorithm and ensemble
clustering algorithm with H-K clustering algorithm.
An Iterative Improved k-means ClusteringIDES Editor
Clustering is a data mining (machine learning),
unsupervised learning technique used to place data elements
into related groups without advance knowledge of the group
definitions. One of the most popular and widely studied
clustering methods that minimize the clustering error for
points in Euclidean space is called K-means clustering.
However, the k-means method converges to one of many local
minima, and it is known that the final results depend on the
initial starting points (means). In this research paper, we have
introduced and tested an improved algorithm to start the kmeans
with good starting points (means). The good initial
starting points allow k-means to converge to a better local
minimum; also the numbers of iteration over the full dataset
are being decreased. Experimental results show that initial
starting points lead to good solution reducing the number of
iterations to form a cluster.
Similar to New proximity estimate for incremental update of non uniformly distributed clusters (20)
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Accelerate your Kubernetes clusters with Varnish Caching
New proximity estimate for incremental update of non uniformly distributed clusters
1. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
DOI : 10.5121/ijdkp.2013.3509 91
NEW PROXIMITY ESTIMATE FOR
INCREMENTAL UPDATE OF NON-UNIFORMLY
DISTRIBUTED CLUSTERS
A.M.Sowjanya and M.Shashi
Department of Computer Science and Systems Engineering,
College of Engineering, Andhra University, Visakhapatnam, India
ABSTRACT
The conventional clustering algorithms mine static databases and generate a set of patterns in the form of
clusters. Many real life databases keep growing incrementally. For such dynamic databases, the patterns
extracted from the original database become obsolete. Thus the conventional clustering algorithms are not
suitable for incremental databases due to lack of capability to modify the clustering results in accordance
with recent updates. In this paper, the author proposes a new incremental clustering algorithm called
CFICA(Cluster Feature-Based Incremental Clustering Approach for numerical data) to handle numerical
data and suggests a new proximity metric called Inverse Proximity Estimate (IPE) which considers the
proximity of a data point to a cluster representative as well as its proximity to a farthest point in its vicinity.
CFICA makes use of the proposed proximity metric to determine the membership of a data point into a
cluster.
KEYWORDS
Data mining, Clustering, Incremental Clustering, K-means, CFICA, Non-uniformly distributed clusters,
Inverse proximity Estimate, Cluster Feature.
1. INTRODUCTION
Clustering discovers patterns from a wide variety of domain data, thus many clustering
algorithms were developed by researchers. The main problem with the conventional clustering
algorithms is that, they mine static databases and generate a set of patterns in the form of clusters.
Numerous applications maintain their data in large databases or data warehouses and many real
life databases keep growing incrementally. New data may be added periodically either on a daily
or weekly basis. For such dynamic databases, the patterns extracted from the original database
become obsolete. Conventional clustering algorithms handle this problem by repeating the
process of clustering on the entire database whenever a significant set of data items are added. Let
DS be the original data base (static database) and DS∆ be the incremental database.
Conventional clustering algorithms process the expanded database (SD + ∆ SD) to form new
cluster solution from scratch. The process of re-running the clustering algorithm on the entire
2. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
92
dataset is inefficient and time-consuming. Thus most of the conventional clustering algorithms
are not suitable for incremental databases due to lack of capability to modify the clustering results
in accordance with recent updates.
In this paper, the author proposes a new incremental clustering algorithm called CFICA (Cluster
Feature-Based Incremental Clustering Approach for numerical data) to handle numerical data. It
is an incremental approach to partitional clustering. CFICA uses the concept of Cluster Feature
(CF) for abstracting out the details of data points maintained in the hard disk. At the same time
Cluster Feature provides all essential information required for incremental update of a cluster.
Most of the conventional clustering algorithms make use of Euclidean distance ( ED ) between the
cluster representatives ( mean / mode / medoid ) and the data point to estimate the acceptability of
the data point into the cluster.
In the context of incremental clustering while adopting the existing patterns or clusters to the
enhanced data upon the arrival of a significant chunk of data points, it is often required to
elongate the existing cluster boundaries in order to accept new data points if there is no loss of
cluster cohesion. The author has observed that the Euclidean distance (ED) between the single
point cluster representative and the data point will not suffice for deciding the membership of the
data point into the cluster except for uniformly distributed clusters. Instead, the set of farthest
points of a cluster can represent the data spread within a cluster and hence has to be considered
for formation of natural clusters. The authors suggest a new proximity metric called Inverse
Proximity Estimate (IPE) which considers the proximity of a data point to a cluster representative
as well as its proximity to a farthest point in its vicinity. CFICA makes use of the proposed
proximity metric to determine the membership of a data point into a cluster.
2. RELATED WORK
Incremental clustering has attracted the attention of the research community with Hartigan’s
Leader clustering algorithm [1] which uses a threshold to determine if an instance can be placed
in an existing cluster or it should form a new cluster by itself. COBWEB [2] is an unsupervised
conceptual clustering algorithm that produces a hierarchy of classes. Its incremental nature allows
clustering of new data to be made without having to repeat the clustering already made. It has
been successfully used in engineering applications [3]. CLASSIT [4] is an alternative version of
COBWEB. It handles continuous or real valued data and organizes them into a hierarchy of
concepts. It assumes that the attribute values of the data records belonging to a cluster are
normally distributed. As a result, its application is limited. Another such algorithm was developed
by Fazil Can to cluster documents [5]. Charikar et al. defined the incremental clustering problem
and proposed a incremental clustering model which preserves all the desirable properties of HAC
(hierarchical agglomerative clustering) while providing a extension to the dynamic case. [6].
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is especially suitable for
large number of data items [7]. Incremental DBSCAN was presented by Ester et al., which is
suitable for mining in a data warehousing environment where the databases have frequent updates
[8]. The GRIN algorithm, [9] is an incremental hierarchical clustering algorithm for numerical
data sets based on gravity theory in physics. Serban and Campan have presented an incremental
algorithm known as Core Based Incremental Clustering (CBIC), based on the k-means clustering
method which is capable of re-partitioning the object set when the attribute set changes [10]. The
new demand points that arrive one at a time have been assigned either to an existing cluster or a
3. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
93
newly created one by the algorithm in the incremental versions of Facility Location and k-median
to maintain a good solution [11].
3. FUNCTIONALITY OF CFICA
An incremental clustering algorithm has to perform the primary tasks namely, initial cluster
formation and their summaries, acceptance of new data items into either existing clusters or new
clusters followed by merging of clusters to maintain compaction and cohesion. CFICA also takes
care of concept-drift and appropriately refreshes the cluster solution upon significant deviation
from the original concept.
It may be observed that once the initial cluster formation is done and summaries are represented
as Cluster Features, all the basic tasks of the incremental clustering algorithm CFICA can be
performed without requiring to read the actual data points ( probably maintained in hard disk )
constituting the clusters. The data points need to be refreshed only when the cluster solution has
to be refreshed due to concept-drift.
3.1 Initial clustering of the static database
The proposed algorithm CFICA is capable of clustering incremental databases starting from
scratch. However, during the initial stages refreshing the cluster solution happens very often as
the size of the initial clusters is very small. Hence for efficiency reasons the author suggests to
apply a partitional clustering algorithm to form clusters on the initial collection of data points
( DS ). The author used the k-means clustering algorithm for initial clustering to obtain k number
of clusters as it is the simplest and most commonly used partitional clustering algorithm. Also k-
means is relatively scalable and efficient in processing large datasets because the computational
complexity is O(nkt) where n is the total number of objects, k is the number of clusters and t
represents the number of iterations. Normally, k<<n and t<<n and hence O(n) is taken as its time
complexity [12].
3.2 Computation of Cluster Feature (CF)
CFICA uses cluster features for accommodating the essential information required for
incremental maintenance of clusters. The basic concept of cluster feature has been adopted from
BIRCH as it supports incremental and dynamic clustering of incoming objects. As CFICA
handles partitional clusters as against hierarchical clusters handled by BIRCH, the original
structure of cluster feature went through appropriate modifications to make it suitable for
partitional clustering.
The Cluster Feature (CF ) is computed for every cluster ic obtained from the k-means algorithm.
In CFICA, the Cluster Feature is denoted as,
CFi = {ni , mi , ݉
′
, ܳ ,ݏݏ}
where ni → number of data points,
4. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
94
mi → mean vector of the cluster Ci with respect to which farthest points are calculated,
݉
′
→ new mean vector of the cluster Ci that changes due to incremental updates,
Qi → list of p-farthest points of cluster Ci
ݏݏ → squared sum vector that changes during incremental updates.
A Cluster Feature is aimed to provide all essential information of a cluster in the most concise
manner. The first two components ni and mi are essential to represent the cluster prototype in a
dynamic environment. ni will be incremented whenever new data point is added to a cluster. ݉
′
,
the new mean is essential to keep track of dynamically changing nature / concept – drift occurring
in the cluster while it is growing. It is updated upon inclusion of a new data point.
The Qi , set of p-farthest points of cluster Ci from its existing mean mi , are used to handle non-
uniformly distributed and hence irregularly shaped clusters; The set of p - farthest points of the ith
cluster, are calculated as follows: First Euclidean distances are calculated between the data points
within cluster ic and the mean of the corresponding cluster mi. Then, the data points are arranged
in descending order with the help of the measured Euclidean distances. Subsequently, the top p -
farthest points for every cluster are chosen from the sorted list and these points are known as the p
- farthest points of the cluster ic with respect to the mean value, mi . Thus a list of p - farthest
points is maintained in the Cluster Feature for every cluster, Ci. These p - farthest points are
subsequently used for identifying the farthest point, qi . ݏݏ, is the squared sum which is essential
for estimating the quality of cluster in terms of variance of data points from its mean
In general, the variance of a cluster ( σ2
)containing ‘N’ data points is defined as
ߪଶ
=
ଵ
ே
∑ ( ݔ − ̅ݔ)ଶே
ୀଵ (
ଵ
ே
∑ ݔ
ଶ
) − ̅ݔଶே
ୀଵ .
In the present context, the error associated with ith
cluster represented by its Cluster Feature, CFi is
calculated as given
ߪ
ଶ
= ቀ
ଵ
∗ ܵܵ ቁ − ݉
ଶ
3.3 Insertion of a new data point
Data points in a cluster can either be uniformly or non-uniformly distributed. The shape of a
uniformly distributed cluster is nearly globular and its centroid is located in the middle
(geometrical middle). Non-uniformly distributed clusters have their centroid located in the midst
of dense area, especially if there is a clear variation in the density of data points among dense and
sparse areas of the cluster. The shape of such clusters is not spherical and the farthest points of a
non-uniformly distributed cluster are generally located in the sparse areas.
5. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
95
Fig.1 Uniformly Distributed cluster Fig.2 Non-Uniformly Distributed cluster
Here, refers to entities in a cluster
is the cluster representative ( mean / mode / medoid )
represent p-farthest points of a cluster
farthest point in the vicinity of the entity
new entity to be incorporated
Fig.3
In the above figure 3, let point 1, be located on the sparser side and point 2 be located on the
denser side of the cluster at nearly equal distances from the centroid. Point C is the farthest point
in the vicinity of point 1 and Point A is the farthest point in the vicinity of point 2. Even if the
point 1 is at a slightly larger distance than point 2 to the centroid it is natural to include point 1
into the cluster Ci compared to point 2, due to point 1’s closeness to existing members ( though
farthest points) of the cluster as well as the discontinuity between point 2 to the existing boundary
of the cluster on its side.
The results produced by standard partitional clustering algorithms like K-means are not in
concurrence with this natural expectation as they rely upon Euclidean distance metric (ED) for
discriminating data points while determining their membership into a cluster. So, the author
proposes a new proximity metric called Inverse Proximity Estimate (IPE) to determine the
6. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
96
membership of a data point into a cluster considering the data distribution within the cluster
through a Bias factor in addition to normal Euclidean distance as shown below
ܧܲܫ ∆௬
()
= ܦܧ (݉ , ∆)ݕ + ܤ
where, ܧܲܫ ∆௬
()
proposed distance metric
ܦܧ (݉ , ∆)ݕ Euclidean distance between the centroid, ݉పሬሬሬሬԦ of the cluster Ci
and the incoming data point, ∆.ݕ
B Bias factor
Bias is the increment added to the conventional distance metric in view of formation of more
natural clusters and better detection of outliers. It considers the unevenness / shape of the cluster
reflected through a set of p- farthest points to estimate the proximity of new data points to the
cluster.
3.3.1 Bias
Fig.4
In Figure 4, A, B, C are the set of farthest points and it can be observed that they are located in
the sparser areas of the cluster while its centroid is in the midst of dense area. Point C is the
farthest point in the vicinity of point 1 and Point A is the farthest point in the vicinity of point 2.
But the distance between C and point 1 is much smaller than the distance between A and point 2
which is taken as one of the factors that assess the bias and the bias increases with the distance of
the new point to a farther point. So, bias is proportional to the Euclidean distance, ED between
the farthest point ( ݍ) and the incoming new data point ( ∆ݕ ).
∴ ܤ ߙ ܦܧ (ݍ , ∆)ݕ
Another aspect to be considered is, as the distance between the centroid and the particular farthest
point in the vicinity of the new point increases, elongation of the cluster with respect to that
farthest point should be discouraged. This is depicted in the Figure 5 below.
7. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
97
Fig.5
In the Figure 5, point 1 is in the vicinity of C and Point 3 is in the vicinity of A while both A and
C are farthest points. In this context point 3 is more acceptable than point 1 into the cluster as C is
farther than A to the centroid. The bias increases with the distance of a particular farthest point to
its centroid. Therefore, bias is proportional to the Euclidean distance, ED between the centroid
( ݉పሬሬሬሬԦ ) and the farthest point in the vicinity of the incoming data point ( ∆ݕ ).
∴ ܤ ߙ ܦܧ (݉ , ݍ)
Hence, bias is estimated as a product of ED (ݍ , ∆)ݕ , ܦܧ (݉ , ݍ) mathematically from the
above equations. Therefore bias is expressed as,
ܤ = [ ܦܧ (ݍ , ∆)ݕ ∗ ܦܧ ( ݉ , ݍ) ]
3.3.2 Proposed Proximity Metric – Inverse Proximity Estimate (IPE)
The authors have devised a new proximity estimate ܧܲܫ ∆௬
()
to determine the acceptability of an
incoming data point ∆ݕ into a possibly non-uniformly distributed cluster represented by its mean
݉ which is calculated as follows
ܧܲܫ ∆௬
()
= ܦܧ (݉ , ∆)ݕ + ܤ
Substituting B from the above, now (ܧܲܫ ∆௬
()
) is estimated as
ܧܲܫ ∆௬
()
= ܦܧ (݉ , ∆)ݕ + [ ܦܧ (ݍ , ∆)ݕ ∗ ED (݉ , ݍ)]
For clusters with uniformly distributed points, the usual distance measures like Euclidean
distance hold good for deciding the membership of a data point into a cluster. But there exist
applications where clusters have non-uniform distribution of data points. In such cases, the new
distance metric proposed above will be useful. The inverse proximity estimate, ܧܲܫ ∆௬
()
better
recognizes the discontinuities in data space while extending the cluster boundaries as explained
below.
3.3.3 Extendibility of the cluster boundary towards Sparse area Vs Dense area
Extendibility of a cluster to include a new point towards sparse area is based on a set of farthest
points defining its boundary around the sparse area.
8. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
98
The points that can be included in the region nearer to the dense area by extending the boundary
of the cluster towards dense area are much nearer to the centroid of this cluster only, based on the
original distance metric, ED as well as the inverse proximity estimate, ܧܲܫ ∆௬
()
. In other words
such data points that are outside the present cluster boundary are closer to the centroid compared
to existing farther members.
However, an external point which is located towards dense area at an equal distance from the
centroid as the farthest points is clearly separated from the dense area with a discontinuity in
between. The inverse proximity estimate, ܧܲܫ ∆௬
()
does not allow extension of the boundary of
the cluster on the denser side, if there is a discontinuity between dense area and the new point,
whereas such discontinuity may be overlooked by the original distance metric, ED.
Let point 5 and point 6 are equidistant from the centroid of the cluster. As shown in the Figure 6,
point 5 is towards its sparse area and point 6 is towards its dense area.
Fig. 7
Though point 6 is at an equal distance to centroid as point 5, point 6 is not acceptable into the
cluster because of the discontinuity.
The only cluster members in the vicinity of point 6, with a possible discontinuity between itself
and the cluster boundary, are located in the dense area. Hence, none of them are considered as
farther points of the cluster. Since the sparse area containing farther points is away from point 6,
Euclidean distance between the farthest point and the incoming new data point, ܧ (ݍ , ∆)ݕ is
high for point 6 compared to point 5, increasing the value of bias for point 6. This in turn
increases the inverse proximity estimate, ܧܲܫ ∆௬
()
for point 6 compared to point 5, thereby,
reducing the acceptability of point 6 into the cluster.
3.4. Incremental Clustering Approach with CFICA
CFICA uses the inverse proximity estimate, ܧܲܫ ∆௬
()
for effectively identifying the appropriate
cluster of an incoming data point y∆ . In other words, it estimates the proximity of the incoming
point to a cluster based on the cluster centroid (mean ݉), farthest point in the vicinity of the
incoming point ( qi) and the incoming data point (∆y).
For each cluster ic , the Euclidean distance ED is calculated for the following pairs of points:
centroid and incoming point (݉, ∆,)ݕ farthest point in the vicinity of the incoming point and
incoming point(ݍ , ∆,)ݕ centroid and farthest point in the vicinity of the incoming point (݉ , ݍ).
9. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
99
Upon the arrival of a new data point y∆ to the existing database DS which is already clustered
into C = {C1, C2, …….., Ck} clusters, its distance, ܧܲܫ ∆௬
()
to ith
cluster for all i = 1 to k is
calculated using the Equation specified in section 2.3.2
ܧܲܫ ∆௬
()
= ܦܧ (݉ , ∆)ݕ + [ ܦܧ (ݍ , ∆)ݕ ∗ ED (݉ , ݍ)]
where, ܦܧ (݉ , ∆)ݕ Euclidean distance between the points im and y∆
ܦܧ (ݍ , ∆)ݕ Euclidean distance between the points iQ and y∆
ED (݉ , ݍ) Euclidean distance between the points im and iQ
3.5 Finding the farthest point in the vicinity of incoming data point y∆
In order to find the farthest point qi ( data point in Qi ) which is in the vicinity of the incoming
data point y∆ , the Euclidean distance is calculated for that data point y∆ to each of the p -
farthest points of that cluster and the data point with minimum distance is designated as qi
thereby, for each cluster, the point having minimum distance is taken as the farthest point qi..
3.6 Finding the appropriate cluster
The proximity metric, IPE is used to find the appropriate cluster for the new data point, y∆ . The
new data point, y∆ is assigned to the closest cluster only if the calculated proximity, ܧܲܫ ∆௬
()
is
less than the predefined threshold value, λ. Otherwise, the data point y∆ is not included in any of
the existing clusters, but it separately forms a new singleton cluster. In such a case, the number of
clusters is incremented by one.
3.7 Updating of Cluster Feature
From the above section, it can be seen that, whenever a new data point, ∆ݕ is added to the
existing database, that new data point may be included into any of the existing clusters or it may
form a new cluster. So after the new point gets inserted, updating of CF is important for further
processing.
Case 1: Inclusion of new data point into any of the existing clusters
Whenever a new data point, ∆ݕ is included into an already existing cluster Ci its cluster feature
(CFi) is updated without requiring the original data points of Ci and hence supports incremental
update of the clustering solution.
In particular, the ni, ݉ప
′ ,ሬሬሬሬሬሬሬԦ and ݏݏ fields of CFi are updated upon the arrival of a new data point ∆ݕ
into the cluster Ci as given below:
10. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
100
(1) ni = ni + 1
(2) ݉
′
=
∗
′
ା∆௬
(3) ݏݏ = ݏݏ + squared components of ∆.ݕ
However, the Qi representing the p-farthest points and the centroid of the previous snapshot ݉
′
,
were kept without any changes until the next periodical refresh.
Case 2: New data point forms a singleton cluster
If the new data point forms a new cluster separately, the cluster feature (CFi ) has to be computed
for the new cluster containing the data point y∆ . CFi for the new cluster contains the following
information:
(1) Number of data points. Here n=1, as it is a singleton cluster,
(2) Mean of the new cluster; In this case, mi = ݉
′
= ߂ݕ
(3) The list of p-farthest points includes ߂ݕ only.
(4) Squared sum is squared sum of components of ߂.ݕ
Finding the appropriate cluster to incorporate the new data point y∆ and updating of the cluster
feature after adding it to the existing cluster or forming a separate cluster are iteratively
performed for all the data points in the incremental database DS∆ .
3.8 Merging of closest cluster pair
Once the incremental database DS∆ is processed with CFICA, the need to merge the closest
clusters may arise. A merging strategy is used to maintain reasonable number of clusters with
high quality. Therefore, merging is performed when the number of clusters increases beyond ‘k’
( k in k-means) while ensuring that increase in variance which indicates error is minimum due to
merging. It is intuitive to expect an increase in the error with the decrease in the number of
clusters. A closest cluster pair is considered for merging if only the Euclidean distance between
the centroids of the pair of clusters is smaller than user defined merging threshold (θ).
The procedure used for the merging process is described below:
Step 1: Calculate the Euclidean distance ED between every pair of cluster centroids (݉పሬሬሬሬԦ).
Step 2: For every cluster pair, with Euclidean distance, ED less than the merging threshold
value, ߠ ( ED ≪ ߠ ), find increase in variance (ߪଶ
) as described in section 3.2.
Step 3: Identify the cluster pair (Ci, Cj) with minimum increase in variance.
11. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
101
Step 4 : Merge Ci and Cj to form the new cluster, Ck and compute the Cluster Feature for Ck
and delete Ci and Cj along with their CF’s.
Step 5 : Repeat steps 1 to 4 until no cluster pair is mergable or until the value of ‘k’ is adjusted.
Computing Cluster Feature of merged cluster
After merging the closest cluster pair, now the Cluster Feature ( CF ) has to be computed for the
merged cluster. The CF of the merged cluster is calculated by:
(1) Adding the number of data points in both the clusters
(2) Calculating the mean of the merged cluster
(3) Finding the p-farthest points of the merged cluster
(4) Incrementing the squared sum.
In particular, when the ith
cluster with CFi = { ni, mi , mi
’
, Qi , ssi} and jth
cluster with CFj = { nj ,
mj , mj
’
, Qj ,ssj } are merged to form kth
cluster, its cluster feature CFk is calculated as given below:
1) nk = ( ni + nj )
2) ݉ሬሬሬሬሬԦ =
(୬ ∗ ഢሬሬሬሬሬԦ ) ା (୬ౠ ∗ ണሬሬሬሬሬሬԦ )
ೖ
3) ݉
′ሬሬሬሬሬԦ = ݉ሬሬሬሬሬԦ
4) p-farthest points are selected from farthest points in qi and qj .
5) ssk = ssi + ssj
Hence the Cluster Feature of the new cluster ‘k’ which is formed by merging two existing clusters
‘i’ and ‘j’ is determined as a function of CFi and CFj.
CFk = f ( CFi , CFj)
Thus Cluster Feature provides the essence of the clustered data points thereby avoiding explicit
referencing of individual objects of the clusters which may be maintained in the external memory
space.
3.9 Need for Cluster Refresh
The addition of new data points into some of the existing clusters naturally results in change of
mean. So, the set of p-farthest points of cluster Ci, need not be the p-farthest points of cluster ܥ
′
,
modified version of Ci. Whether or not, the p-farthest points of the cluster are the same, depend
upon the recently added data points. Ideally, a new set of p-farthest points have to be computed
for the incremented cluster ܥ
′
. Calculating p-farthest points again, every time the cluster gets
updated, is not an easy task as it involves recalculation of the Euclidean distance, ED for every
data point in the incremented cluster ܥ
′
with the updated mean, ݉
′
.
12. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
For pragmatic reasons, it was suggested to refresh the
from its original value indicating concept drift [Chen H.L et al. 2009]. After processing a new
chunk of data points from ∆SD the deviation in mean is calculated
Those clusters with deviation in mean greater than
clusters with significant concept
refresh involves finding a new set of p
clusters, the same set of p-farthest points along with the old mean value, m
be noted that mi in the CFi always represents the centro
identified. Hence needs to be changed whenever new set of p
above process of cluster refresh is applicable only to the incremented cluster C
of additional data points
4. EXPERIMENTAL ANALYSIS
CFICA has been implemented using the Iris [13], wine [14] and yeast [15] datasets from the UCI
machine learning repository. All the datasets were preprocessed. Iris and Wine datasets do not
have any missing values but Yeast dataset has some missing attribute values and such instances
were ignored. After preprocessing, yeast dataset had 1419 instances. Normally datasets used for
the purpose of analysis may contain too many attributes, which may or may not be relevant.
Therefore dimensionality reduction has been done on Wine and Yeast datasets as they contain a
considerable number of attributes. So the number of attributes came down from 13 to 3 for wine,
from 8 to 5 attributes for yeast datasets.
Fig 1. Purity of Wine dataset before and after dimensionality reduction
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
4
of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
ragmatic reasons, it was suggested to refresh the CFi, only in case it deviates significantly
from its original value indicating concept drift [Chen H.L et al. 2009]. After processing a new
the deviation in mean is calculated as follows
Deviation in mean =
′
ି
Those clusters with deviation in mean greater than δ, based on a threshold are identified as
clusters with significant concept – drift and hence needs to be refreshed. The process of c
refresh involves finding a new set of p-farthest points for such clusters. For the remaining
farthest points along with the old mean value, mi is maintained. It may
always represents the centroid based on which the p-farthest points are
identified. Hence needs to be changed whenever new set of p-farthest points are identified. The
above process of cluster refresh is applicable only to the incremented cluster Ci
’
due to inclusion
NALYSIS
CFICA has been implemented using the Iris [13], wine [14] and yeast [15] datasets from the UCI
machine learning repository. All the datasets were preprocessed. Iris and Wine datasets do not
Yeast dataset has some missing attribute values and such instances
were ignored. After preprocessing, yeast dataset had 1419 instances. Normally datasets used for
the purpose of analysis may contain too many attributes, which may or may not be relevant.
erefore dimensionality reduction has been done on Wine and Yeast datasets as they contain a
considerable number of attributes. So the number of attributes came down from 13 to 3 for wine,
from 8 to 5 attributes for yeast datasets.
Fig 1. Purity of Wine dataset before and after dimensionality reduction
8
12
16
20
Before
After
of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
102
, only in case it deviates significantly
from its original value indicating concept drift [Chen H.L et al. 2009]. After processing a new
, based on a threshold are identified as
drift and hence needs to be refreshed. The process of cluster
farthest points for such clusters. For the remaining
is maintained. It may
farthest points are
farthest points are identified. The
due to inclusion
CFICA has been implemented using the Iris [13], wine [14] and yeast [15] datasets from the UCI
machine learning repository. All the datasets were preprocessed. Iris and Wine datasets do not
Yeast dataset has some missing attribute values and such instances
were ignored. After preprocessing, yeast dataset had 1419 instances. Normally datasets used for
the purpose of analysis may contain too many attributes, which may or may not be relevant.
erefore dimensionality reduction has been done on Wine and Yeast datasets as they contain a
considerable number of attributes. So the number of attributes came down from 13 to 3 for wine,
Before
After
13. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
Fig 2. Purity of Yeast dataset before and after dimensionality reduction
It can be seen from the above figures that the cluster purity increases after dimensionality
reduction has been done.
4.1 Cluster formation
The above datasets are processed dynamically by dividing the data points in each dataset into
chunks. For example, the Iris dataset is divided into 4 chunks (Iris1 consisting of 75 instances,
Iris2, Iris3 and Iris4 consisting of 25 instances each). Initially, the first chunk of data points is
given as input to the k-means algorithm for initial clustering i.e Iris1. It generates k number of
clusters. Then, the cluster feature is computed for those k initial clusters. The nex
points is input to the new approach incrementally. For each data point from the second chunk, the
Inverse Proximity Estimate (IPE) is computed and the data points are assigned to the
corresponding cluster if the calculated distance measure
value, λ. Otherwise, it forms as a separate cluster. Subsequently, the cluster feature is updated for
each data point. Once the whole chunk of data points is processed, the merging process is done if
only the Euclidean distance between the centroids of the pair of clusters is smaller than user
defined merging threshold, θ. Here,
obtained from the merging process and the cluster solution is updated incrementally
receiving the later chunks of data Iris3 and Iris4. Similarly, wine and yeast datasets are also
divided into 4 chunks as follows: wine1
wine4 - 28 instances, yeast1 - 700 instances, yeast 2
yeast4 - 168 instances and handled as above.
4.2 Metrics in which performance is estimated
Validation of clustering results is important. Therefore, the Purity Measure has been used to
evaluate the clustering results ob
to a single class. The purity measure described in [16] [17] has been used for evaluating the
performance of CFICA. The evaluation metric used in CFICA is given below,
0
0.2
0.4
0.6
0.8
4 8 12
of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
Fig 2. Purity of Yeast dataset before and after dimensionality reduction
It can be seen from the above figures that the cluster purity increases after dimensionality
The above datasets are processed dynamically by dividing the data points in each dataset into
chunks. For example, the Iris dataset is divided into 4 chunks (Iris1 consisting of 75 instances,
ing of 25 instances each). Initially, the first chunk of data points is
means algorithm for initial clustering i.e Iris1. It generates k number of
clusters. Then, the cluster feature is computed for those k initial clusters. The next chunk of data
points is input to the new approach incrementally. For each data point from the second chunk, the
Inverse Proximity Estimate (IPE) is computed and the data points are assigned to the
corresponding cluster if the calculated distance measure is less than the predefined threshold
. Otherwise, it forms as a separate cluster. Subsequently, the cluster feature is updated for
each data point. Once the whole chunk of data points is processed, the merging process is done if
distance between the centroids of the pair of clusters is smaller than user
θ. Here, λ = 10 and θ = 4.Finally, the set of resultant clusters are
obtained from the merging process and the cluster solution is updated incrementally
receiving the later chunks of data Iris3 and Iris4. Similarly, wine and yeast datasets are also
divided into 4 chunks as follows: wine1 - 100 instances, wine2 and wine3 - 25 instances each,
700 instances, yeast 2 – 350 instances, yeast3 - 200 instances,
168 instances and handled as above.
4.2 Metrics in which performance is estimated
Validation of clustering results is important. Therefore, the Purity Measure has been used to
evaluate the clustering results obtained. A cluster is called a pure cluster if all the objects belong
to a single class. The purity measure described in [16] [17] has been used for evaluating the
performance of CFICA. The evaluation metric used in CFICA is given below,
Purity =
ଵ
ே
∑ ܺ
்
ୀଵ
16 20 22 25 28 30
Before
After
of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
103
It can be seen from the above figures that the cluster purity increases after dimensionality
The above datasets are processed dynamically by dividing the data points in each dataset into
chunks. For example, the Iris dataset is divided into 4 chunks (Iris1 consisting of 75 instances,
ing of 25 instances each). Initially, the first chunk of data points is
means algorithm for initial clustering i.e Iris1. It generates k number of
t chunk of data
points is input to the new approach incrementally. For each data point from the second chunk, the
Inverse Proximity Estimate (IPE) is computed and the data points are assigned to the
is less than the predefined threshold
. Otherwise, it forms as a separate cluster. Subsequently, the cluster feature is updated for
each data point. Once the whole chunk of data points is processed, the merging process is done if
distance between the centroids of the pair of clusters is smaller than user
= 4.Finally, the set of resultant clusters are
obtained from the merging process and the cluster solution is updated incrementally upon
receiving the later chunks of data Iris3 and Iris4. Similarly, wine and yeast datasets are also
25 instances each,
200 instances,
Validation of clustering results is important. Therefore, the Purity Measure has been used to
tained. A cluster is called a pure cluster if all the objects belong
to a single class. The purity measure described in [16] [17] has been used for evaluating the
Before
After
14. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
104
where, N Number of data points in the dataset
T Number of resultant cluster
iX Number of data points of majority class in cluster i
The purity of the resultant clusters is calculated by changing the k-value (order of initial
clustering) and is graphically presented in Figures 3 to 5.
Fig 3. Purity vs. number of clusters ( k ) for Iris dataset.
Fig 4. Purity vs. number of clusters ( k ) for Wine dataset.
Fig 5. Purity vs. number of clusters ( k ) for Yeast dataset.
15. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
105
The above results demonstrate that CFICA performs better than BIRCH in terms of purity.
BIRCH algorithm is unable to deliver satisfactory clustering quality if the clusters are not
spherical in shape because it employs the notion of radius or diameter to control the boundary of
a cluster.
5. THE PSEUDO-CODE FOR CFICA
The various steps constituting the proposed algorithm CFICA is listed out in the form of pseudo-
code. CFICA accepts the clustering solution for existing database SD in the form of cluster
features as input and incrementally updates the set of cluster features in ∆SD. in accordance with
newly arrived chunk of data points.
Input
ܵ → Initial set of CF’s for SD.
∆ܵ → Set of data points added to SD
K → Number of clusters
θ → Predefined merging threshold value
δ → Allowed deviation in mean
λ → User defined radius threshold
Output
Set of CF’s { CF1, CF2, . . . . . . , CFK } for ( SD.+ ∆SD)
Variables
∆ݕ → New data point
ܨܥ → Cluster Feature of ith
cluster
ܧܲܫ ∆௬
()
→ Inverse Proximity Estimate between ∆ݕ and ith
cluster
ED → Euclidean distance
݊ → Number of data points
݉ → Mean vector of the cluster
݉
′
→ New mean vector of the cluster
ܳ → List of p-farthest points of ith
cluster
ݍ → The data point in ܳ which is closest to ∆ݕ
ssi → Squared sum vector
16. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
106
Algorithm
1) Compute Cluster Feature for every cluster ܥ
CFi = { ni, mi, mi
'
, Qi, ssi}
a = k;
2) For each data point ∆ݕ ߳ ∆ܵ
a) For i from 1 to a
Calculate Inverse Proximity Estimate of ∆ݕ to the ith
cluster
ܧܲܫ ∆௬
()
= ܦܧ (݉ , ∆)ݕ + [ ܦܧ (ݍ , ∆)ݕ ∗ ED (݉ , ݍ)]
b) Find suitable cluster j = arg mini { ܧܲܫ ∆௬
()
}
if ( ܧܲܫ ∆௬
()
< λ )
{
insert ∆ݕ into its closest cluster j and update its CF as follows:
CFj = { (nj +1), mj,
ೕ∗ ೕା ∆௬
ೕା ଵ
, qj,(SSi +squared components of ∆})ݕ
if ( deviation in mean =
|ೕ
′
ି ೕ|
|ೕ|
> ߜ)
Read the data points of jth
cluster to recompute CFj.
}
else create a new cluster
{
increment a
insert ∆ݕ into ath
cluster
CFa = {1, ma , ma
'
, ∆y, squared components of ∆y}
}
17. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
107
3) If ( a > k ) merge cluster pair
(i) Compute ED between every pair of cluster centroids based on ݉.
(ii) For cluster pairs with ( ED≪ ߠ ), find increase in variance.
(iii) Merge the cluster pair with minimum increase in variance
Decrement a
(iv) Recalculate CF of merged cluster
CFk = { ( ni + nj ),
(୬ ∗ ୫) ା (୬ౠ ∗ ୫ౠ)
ೖ
, ݉k = ݉
′
, qk,, SSk = SSi + SSj }
Find its ED to other clusters.
(v) Repeat steps (i) to (iv) until ( a = k )
4) Print cluster solution as the set of cluster features.
Wait for the arrival of new chunk of data points upon which call CFICA
again.
6. CONCLUSIONS
An incremental clustering algorithm called Cluster Feature-Based Incremental Clustering
Approach for numerical data (CFICA) which makes use of Inverse Proximity Estimate to handle
entities described in terms of only numerical attributes was developed and evaluated. Cluster
Feature while being compact includes all essential information required for maintenance and
expansion of clusters. Thus CFICA avoids redundant processing which is the essential feature of
an incremental algorithm. The performance of this algorithm, in terms of purity is compared with
the state of art incremental clustering algorithm namely BIRCH on different bench mark datasets.
REFERENCES
[1] Hartigan, J.A. 1975. Clustering Algorithms. John Wiley and Sons, Inc., New York, NY.
[2] Fisher D., “Knowledge acquisition via incremental conceptual clustering,” Machine Learning,
vol. 2, 1987, pp.139-172.
[3] Fisher , D., Xu, L., Carnes, R., Rich, Y., Fenves, S.J., Chen, J., Shiavi, R., Biswas, G., and
Weinberg, J. 1993. Applying AI clustering to engineering tasks. IEEE Expert 8, 51–60.
[4] J. Gennary, P. Langley, and D. Fisher, “Models of Incremental Concept Formation,” Artificial
Intelligence Journal, vol. 40, 1989, pp. 11-61.
18. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
108
[5] Fazil Can, “Incremental Clustering for Dynamic Information Processing”, ACM Transactions on
Information Systems, April 1993, Vol. 11, No. 2, pp. 143-164.
[6] M. Charikar, C. Chekuri, T. Feder, and R. Motwani, “Incremental clustering and dynamic
information retrieval,” 29th Symposium on Theory of Computing, 1997, pp. 626—635.
[7] Tian Zhang , Raghu Ramakrishnan , Miron Livny, “BIRCH: an efficient data clustering method
for very large databases”, Proceedings of the ACM SIGMOD International. Conference on
Management of Data, pp 103-114, 1996.
[8] Ester, M., Kriegel, H., Sander, J., Xu X., Wimmer, M., “Incremental Clustering for Mining in a
Data Warehousing Environment”, Proceedings of the 24th International. Conference on Very
Large Databases (VLDB’98), New York, USA, 1998, pp. 323-333.
[9] Chien-Yu Chen, Shien-Ching Hwang, and Yen-Jen Oyang, "An Incremental Hierarchical Data
Clustering Algorithm Based on Gravity Theory", Proceedings of the 6th Pacific- Asia
Conference on Advances in Knowledge Discovery and Data Mining, 2002, pp: 237 – 250.
[10] Serban G., Campan A., "Incremental Clustering Using a Core-Based Approach", Lecture Notes
in Computer Science, Springer Berlin, Vol: 3733, pp: 854-863, 2005.
[11] Fotakis D., "Incremental algorithms for Facility Location and k-Median", Theoretical Computer
Science, Vol: 361, No: 2-3, pp: 275-313, 2006.
[12] Jain A.K., Murthy M.N., Flynn P.J., “Data Clustering: A Review”, ACM Computing Surveys,
September 1999, Vol. 31, No.3, pp. 264 – 323.
[13] Iris dataset : http://archive.ics.uci.edu/ml/datasets/Iris
[14] Wine dataset : http://archive.ics.uci.edu/ml/datasets/Wine
[15] Yeast dataset : http://archive.ics.uci.edu/ml/datasets/Yeast
[16] Huang Z, “Extensions to the k-Means Algorithm for Clustering Large Data Sets With
Categorical Values”, Data Mining and Knowledge Discovery, 1998, Vol. 2, pp. 283–304.
[17] Xiaoke Su, Yang Lan, Renxia Wan, and Yuming Qin, “A Fast Incremental Clustering
Algorithm”, Proceedings of the 2009 International Symposium on Information Processing
(ISIP’09), 2009, pp. 175-178.
[18] Sowjanya A.M. and Shashi M., “Cluster Feature-Based Incremental Clustering Approach
(CFICA) For Numerical Data”, IJCSNS International Journal of Computer Science and Network
Security, Vol.10, No.9, September 2010.
19. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.5, September 2013
109
AUTHORS
A.M.Sowjanya received her M.Tech. Degree in Computer Science and technology from
Andhra University. She is presently working as an Assistant Professor in the department
of Computer Science and Systems Engineering, College of Engineering (Autonomous),
Andhra University, Visakhapatnam, Andhra Pradesh, India. She submitted her Ph.D in
Andhra University. Her areas of interest include Data Mining, incremental clustering and
Database management systems.
M.Shashi received her B.E. Degree in Electrical and Electronics and M.E.Degree in
Computer Engineering with distinction from Andhra University. She received Ph.D in
1994 from Andhra University and got the best Ph.D thesis award. She is working as a
professor of Computer Science and Systems Engineering at Andhra University, Andhra
Pradesh, India. She received AICTE career award as young teacher in 1996. She is a co-
author of the Indian Edition of text book on “Data Structures and Program Design in C”
from Pearson Education Ltd. She published technical papers in National and
International Journals. Her research interests include Data Mining, Artificial intelligence, Pattern
Recognition and Machine Learning. She is a life member of ISTE, CSI and a fellow member of Institute of
Engineers (India).