The document discusses employing a rough set based approach for clustering categorical time-evolving data. It proposes using node importance and a rough membership function to label unlabeled data points and detect concept drift between data clusters over time. Specifically, it defines key terms like nodes, introduces the problem of clustering categorical time-series data and concept drift detection. It then describes using a rough membership function to calculate similarity between unlabeled data and existing clusters in order to label the data and detect changes in cluster characteristics over time.
Finding Relationships between the Our-NIR Cluster ResultsCSCJournals
The problem of evaluating node importance in clustering has been active research in present days and many methods have been developed. Most of the clustering algorithms deal with general similarity measures. However In real situation most of the cases data changes over time. But clustering this type of data not only decreases the quality of clusters but also disregards the expectation of users, when usually require recent clustering results. In this regard we proposed Our-NIR method that is better than Ming-Syan Chen proposed a method and it has proven with the help of results of node importance, which is related to calculate the node importance that is very useful in clustering of categorical data, still it has deficiency that is importance of data labeling and outlier detection. In this paper we modified Our-NIR method for evaluating of node importance by introducing the probability distribution which will be better than by comparing the results.
Ensemble based Distributed K-Modes ClusteringIJERD Editor
Clustering has been recognized as the unsupervised classification of data items into groups. Due to the explosion in the number of autonomous data sources, there is an emergent need for effective approaches in distributed clustering. The distributed clustering algorithm is used to cluster the distributed datasets without gathering all the data in a single site. The K-Means is a popular clustering method owing to its simplicity and speed in clustering large datasets. But it fails to handle directly the datasets with categorical attributes which are generally occurred in real life datasets. Huang proposed the K-Modes clustering algorithm by introducing a new dissimilarity measure to cluster categorical data. This algorithm replaces means of clusters with a frequency based method which updates modes in the clustering process to minimize the cost function. Most of the distributed clustering algorithms found in the literature seek to cluster numerical data. In this paper, a novel Ensemble based Distributed K-Modes clustering algorithm is proposed, which is well suited to handle categorical data sets as well as to perform distributed clustering process in an asynchronous manner. The performance of the proposed algorithm is compared with the existing distributed K-Means clustering algorithms, and K-Modes based Centralized Clustering algorithm. The experiments are carried out for various datasets of UCI machine learning data repository.
Reduct generation for the incremental data using rough set theorycsandit
n today’s changing world huge amount of data is ge
nerated and transferred frequently.
Although the data is sometimes static but most comm
only it is dynamic and transactional. New
data that is being generated is getting constantly
added to the old/existing data. To discover the
knowledge from this incremental data, one approach
is to run the algorithm repeatedly for the
modified data sets which is time consuming. The pap
er proposes a dimension reduction
algorithm that can be applied in dynamic environmen
t for generation of reduced attribute set as
dynamic reduct.
The method analyzes the new dataset, when it become
s available, and modifies
the reduct accordingly to fit the entire dataset. T
he concepts of discernibility relation, attribute
dependency and attribute significance of Rough Set
Theory are integrated for the generation of
dynamic reduct set, which not only reduces the comp
lexity but also helps to achieve higher
accuracy of the decision system. The proposed metho
d has been applied on few benchmark
dataset collected from the UCI repository and a dyn
amic reduct is computed. Experimental
result shows the efficiency of the proposed method
Finding Relationships between the Our-NIR Cluster ResultsCSCJournals
The problem of evaluating node importance in clustering has been active research in present days and many methods have been developed. Most of the clustering algorithms deal with general similarity measures. However In real situation most of the cases data changes over time. But clustering this type of data not only decreases the quality of clusters but also disregards the expectation of users, when usually require recent clustering results. In this regard we proposed Our-NIR method that is better than Ming-Syan Chen proposed a method and it has proven with the help of results of node importance, which is related to calculate the node importance that is very useful in clustering of categorical data, still it has deficiency that is importance of data labeling and outlier detection. In this paper we modified Our-NIR method for evaluating of node importance by introducing the probability distribution which will be better than by comparing the results.
Ensemble based Distributed K-Modes ClusteringIJERD Editor
Clustering has been recognized as the unsupervised classification of data items into groups. Due to the explosion in the number of autonomous data sources, there is an emergent need for effective approaches in distributed clustering. The distributed clustering algorithm is used to cluster the distributed datasets without gathering all the data in a single site. The K-Means is a popular clustering method owing to its simplicity and speed in clustering large datasets. But it fails to handle directly the datasets with categorical attributes which are generally occurred in real life datasets. Huang proposed the K-Modes clustering algorithm by introducing a new dissimilarity measure to cluster categorical data. This algorithm replaces means of clusters with a frequency based method which updates modes in the clustering process to minimize the cost function. Most of the distributed clustering algorithms found in the literature seek to cluster numerical data. In this paper, a novel Ensemble based Distributed K-Modes clustering algorithm is proposed, which is well suited to handle categorical data sets as well as to perform distributed clustering process in an asynchronous manner. The performance of the proposed algorithm is compared with the existing distributed K-Means clustering algorithms, and K-Modes based Centralized Clustering algorithm. The experiments are carried out for various datasets of UCI machine learning data repository.
Reduct generation for the incremental data using rough set theorycsandit
n today’s changing world huge amount of data is ge
nerated and transferred frequently.
Although the data is sometimes static but most comm
only it is dynamic and transactional. New
data that is being generated is getting constantly
added to the old/existing data. To discover the
knowledge from this incremental data, one approach
is to run the algorithm repeatedly for the
modified data sets which is time consuming. The pap
er proposes a dimension reduction
algorithm that can be applied in dynamic environmen
t for generation of reduced attribute set as
dynamic reduct.
The method analyzes the new dataset, when it become
s available, and modifies
the reduct accordingly to fit the entire dataset. T
he concepts of discernibility relation, attribute
dependency and attribute significance of Rough Set
Theory are integrated for the generation of
dynamic reduct set, which not only reduces the comp
lexity but also helps to achieve higher
accuracy of the decision system. The proposed metho
d has been applied on few benchmark
dataset collected from the UCI repository and a dyn
amic reduct is computed. Experimental
result shows the efficiency of the proposed method
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
A brief description of clustering, two relevant clustering algorithms(K-means and Fuzzy C-means), clustering validation, two inner validity indices(Dunn-n-Dunn and Devies Bouldin) .
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...ijscmc
Face recognition is one of the most unobtrusive biometric techniques that can be used for access control as well as surveillance purposes. Various methods for implementing face recognition have been proposed with varying degrees of performance in different scenarios. The most common issue with effective facial biometric systems is high susceptibility of variations in the face owing to different factors like changes in pose, varying illumination, different expression, presence of outliers, noise etc. This paper explores a novel technique for face recognition by performing classification of the face images using unsupervised learning approach through K-Medoids clustering. Partitioning Around Medoids algorithm (PAM) has been used for performing K-Medoids clustering of the data. The results are suggestive of increased robustness to noise and outliers in comparison to other clustering methods. Therefore the technique can also be used to increase the overall robustness of a face recognition system and thereby increase its invariance and make it a reliably usable biometric modality
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...IJECEIAES
A hard partition clustering algorithm assigns equally distant points to one of the clusters, where each datum has the probability to appear in simultaneous assignment to further clusters. The fuzzy cluster analysis assigns membership coefficients of data points which are equidistant between two clusters so the information directs have a place toward in excess of one cluster in the meantime. For a subset of CiteScore dataset, fuzzy clustering (fanny) and fuzzy c-means (fcm) algorithms were implemented to study the data points that lie equally distant from each other. Before analysis, clusterability of the dataset was evaluated with Hopkins statistic which resulted in 0.4371, a value < 0.5, indicating that the data is highly clusterable. The optimal clusters were determined using NbClust package, where it is evidenced that 9 various indices proposed 3 cluster solutions as best clusters. Further, appropriate value of fuzziness parameter m was evaluated to determine the distribution of membership values with variation in m from 1 to 2. Coefficient of variation (CV), also known as relative variability was evaluated to study the spread of data. The time complexity of fuzzy clustering (fanny) and fuzzy c-means algorithms were evaluated by keeping data points constant and varying number of clusters.
Introduction to Multi-Objective Clustering EnsembleIJSRD
Association rule mining is a popular and well researched method for discovering interesting relations between variables in large databases. In this paper we introduce the concept of Data mining, Association rule and Multilevel association rule with different algorithm, its advantage and concept of Fuzzy logic and Genetic Algorithm. Multilevel association rules can be mined efficiently using concept hierarchies under a support-confidence framework.
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
A brief description of clustering, two relevant clustering algorithms(K-means and Fuzzy C-means), clustering validation, two inner validity indices(Dunn-n-Dunn and Devies Bouldin) .
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...ijscmc
Face recognition is one of the most unobtrusive biometric techniques that can be used for access control as well as surveillance purposes. Various methods for implementing face recognition have been proposed with varying degrees of performance in different scenarios. The most common issue with effective facial biometric systems is high susceptibility of variations in the face owing to different factors like changes in pose, varying illumination, different expression, presence of outliers, noise etc. This paper explores a novel technique for face recognition by performing classification of the face images using unsupervised learning approach through K-Medoids clustering. Partitioning Around Medoids algorithm (PAM) has been used for performing K-Medoids clustering of the data. The results are suggestive of increased robustness to noise and outliers in comparison to other clustering methods. Therefore the technique can also be used to increase the overall robustness of a face recognition system and thereby increase its invariance and make it a reliably usable biometric modality
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...IJECEIAES
A hard partition clustering algorithm assigns equally distant points to one of the clusters, where each datum has the probability to appear in simultaneous assignment to further clusters. The fuzzy cluster analysis assigns membership coefficients of data points which are equidistant between two clusters so the information directs have a place toward in excess of one cluster in the meantime. For a subset of CiteScore dataset, fuzzy clustering (fanny) and fuzzy c-means (fcm) algorithms were implemented to study the data points that lie equally distant from each other. Before analysis, clusterability of the dataset was evaluated with Hopkins statistic which resulted in 0.4371, a value < 0.5, indicating that the data is highly clusterable. The optimal clusters were determined using NbClust package, where it is evidenced that 9 various indices proposed 3 cluster solutions as best clusters. Further, appropriate value of fuzziness parameter m was evaluated to determine the distribution of membership values with variation in m from 1 to 2. Coefficient of variation (CV), also known as relative variability was evaluated to study the spread of data. The time complexity of fuzzy clustering (fanny) and fuzzy c-means algorithms were evaluated by keeping data points constant and varying number of clusters.
Introduction to Multi-Objective Clustering EnsembleIJSRD
Association rule mining is a popular and well researched method for discovering interesting relations between variables in large databases. In this paper we introduce the concept of Data mining, Association rule and Multilevel association rule with different algorithm, its advantage and concept of Fuzzy logic and Genetic Algorithm. Multilevel association rules can be mined efficiently using concept hierarchies under a support-confidence framework.
Data mining is utilized to manage huge measure of information which are put in the data ware houses and databases, to discover required information and data. Numerous data mining systems have been proposed, for example, association rules, decision trees, neural systems, clustering, and so on. It has turned into the purpose of consideration from numerous years. A re-known amongst the available data mining strategies is clustering of the dataset. It is the most effective data mining method. It groups the dataset in number of clusters based on certain guidelines that are predefined. It is dependable to discover the connection between the distinctive characteristics of data.
In k-mean clustering algorithm, the function is being selected on the basis of the relevancy of the function for predicting the data and also the Euclidian distance between the centroid of any cluster and the data objects outside the cluster is being computed for the clustering the data points. In this work, author enhanced the Euclidian distance formula to increase the cluster quality.
The problem of accuracy and redundancy of the dissimilar points in the clusters remains in the improved k-means for which new enhanced approach is been proposed which uses the similarity function for checking the similarity level of the point before including it to the cluster.
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSijdkp
Subspace clustering discovers the clusters embedded in multiple, overlapping subspaces of high
dimensional data. Many significant subspace clustering algorithms exist, each having different
characteristics caused by the use of different techniques, assumptions, heuristics used etc. A comprehensive
classification scheme is essential which will consider all such characteristics to divide subspace clustering
approaches in various families. The algorithms belonging to same family will satisfy common
characteristics. Such a categorization will help future developers to better understand the quality criteria to
be used and similar algorithms to be used to compare results with their proposed clustering algorithms. In
this paper, we first proposed the concept of SCAF (Subspace Clustering Algorithms’ Family).
Characteristics of SCAF will be based on the classes such as cluster orientation, overlap of dimensions etc.
As an illustration, we further provided a comprehensive, systematic description and comparison of few
significant algorithms belonging to “Axis parallel, overlapping, density based” SCAF.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
The papers for publication in The International Journal of Engineering& Science are selected through rigorous peer reviews to ensure originality, timeliness, relevance, and readability.
Among many data clustering approaches available today, mixed data set of numeric and category data
poses a significant challenge due to difficulty of an appropriate choice and employment of
distance/similarity functions for clustering and its verification. Unsupervised learning models for
artificial neural network offers an alternate means for data clustering and analysis. The objective of this
study is to highlight an approach and its associated considerations for mixed data set clustering with
Adaptive Resonance Theory 2 (ART-2) artificial neural network model and subsequent validation of the
clusters with dimensionality reduction using Autoencoder neural network model.
Published a paper entitled ‘Visualization of Crisp and Rough Clustering using MATLAB’ in CIIT International Journal of Data Mining and Knowledge Engineering on 12th December 2012, Vol.4 , ISSN 0974 – 9683.
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
The k- Means clustering algorithm is an old algorithm that has been intensely researched owing to its ease
and simplicity of implementation. Clustering algorithm has a broad attraction and usefulness in
exploratory data analysis. This paper presents results of the experimental study of different approaches to
k- Means clustering, thereby comparing results on different datasets using Original k-Means and other
modified algorithms implemented using MATLAB R2009b. The results are calculated on some performance
measures such as no. of iterations, no. of points misclassified, accuracy, Silhouette validity index and
execution time
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...ijtsrd
Clustering a large sparse and large scale data is an open research in the data mining. To discover the significant information through clustering algorithm stands inadequate as most of the data finds to be non actionable. Existing clustering technique is not feasible to time varying data in high dimensional space. Hence Subspace clustering will be answerable to problems in the clustering through incorporation of domain knowledge and parameter sensitive prediction. Sensitiveness of the data is also predicted through thresholding mechanism. The problems of usability and usefulness in 3D subspace clustering are very important issue in subspace clustering. . The Solutions is highly helpful benefit for police departments and law enforcement organisations to better understand stock issues and provide insights that will enable them to track activities, predict the likelihood. Also determining the correct dimension is inconsistent and challenging issue in subspace clustering .In this thesis, we propose Centroid based Subspace Forecasting Framework by constraints is proposed, i.e. must link and must not link with domain knowledge. Unsupervised Subspace clustering algorithm with inbuilt process like inconsistent constraints correlating to dimensions has been resolved through singular value decomposition. Principle component analysis is been used in which condition has been explored to estimate the strength of actionable to be particular attributes and utilizing the domain knowledge to refinement and validating the optimal centroids dynamically. An experimental result proves that proposed framework outperforms other competition subspace clustering technique in terms of efficiency, Fmeasure, parameter insensitiveness and accuracy. G. Raj Kamal | A. Deepika | D. Pavithra | J. Mohammed Nadeem | V. Prasath Kumar "Principle Component Analysis Based on Optimal Centroid Selection Model for SubSpace Clustering Model" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-4 , June 2020, URL: https://www.ijtsrd.com/papers/ijtsrd31374.pdf Paper Url :https://www.ijtsrd.com/computer-science/data-miining/31374/principle-component-analysis-based-on-optimal-centroid-selection-model-for-subspace-clustering-model/g-raj-kamal
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
G0354451
1. IOSR Journal of Computer Engineering (IOSRJCE)
ISSN: 2278-0661 Volume 3, Issue 5 (July-Aug. 2012), PP 44-51
www.iosrjournals.org
A Study in Employing Rough Set Based Approach for Clustering
on Categorical Time-Evolving Data
H. Venkateswara Reddy1, S. Viswanadha Raju2
1
(Department of CSE, Vardhaman College of Engineering, India)
2
(Department of CSE, JNTUH College of Engineering, India)
Abstract :-The proportionate increase in the size of the data with increase in space implies that clustering a
very large data set becomes difficult and is a time consuming process. Sampling is one important technique to
scale down the size of dataset and to improve the efficiency of clustering. After sampling, allocating unlabeled
data point into proper cluster is difficult in the categorical domain and in real situations data changes over
time. However, clustering this type of data not only decreases the quality of clusters and also disregards the
expectation of users, who usually require recent clustering results. In both the cases mentioned above, one is of
allocating unlabeled data point into proper clusters after the sampling and the other is of finding clustering
results when data changes over time which is difficult in the categorical domain. In this paper, using node
importance technique, a rough set based method proposed to label unlabeled data point and to find the next
clustering result based on the previous clustering result.
Keywords:-Categorical Data, Data Labeling, Node Importance, Rough Membership Function.
I. INTRODUCTION
Extracting Knowledge from large amount of data is difficult which is known as data mining. Clustering
refer to a method of finding a collection of similar objects from a given data set and objects in different
collection are dissimilar. Most of the algorithms are developed for numerical data for clustering may be easy to
use in normal conditions but not when it comes to categorical data [1, 3]. Clustering is a challenging issue in
categorical domain, where the distance between data points is undefined [4]. It is also not easy to find out the
class label of unknown data point in categorical domain. Sampling techniques accelerate the clustering [5, 6]
and we consider the data points that are not sampled to allocate into proper clusters. The data which depends on
time called time evolving data [7, 8]. For example, the buying preferences of customers may very with time,
depending on the current day of the week, availability of alternatives, discounting rate etc [9]. Since data is
modified and thus evolve with time, the underlying clusters may also change based on time by the data drifting
concept [10, 11]. The clustering time-evolving data in the numerical domain [12, 13] has been explored in the
previous literature though not in the categorical domain to the extent desired. It is a challenging problem in the
categorical domain therefore to evolve a procedure for arriving at precise categorization.
As a result, a rough set [14] based method for performing clustering on the categorical time evolving
data is proposed in this paper. This method find out if there is a drifting concept or not while processing the
incoming data. In order to detect the drifting concept, the sliding window technique is adopted [15]. Sliding
windows conveniently eliminate the outdated records while sifting through a mound of data. Therefore,
employing the sliding window technique, we can test the latest data points in the current window to establish if
the characteristics of clusters are similar to the last clustering result or not. This may be easy insofar as the
numerical domain is concerned. However, in the categorical domain, the above procedure is challenging since
the numerical characteristics of clusters are difficult to define.
Therefore, for capturing the characteristics of clusters, an effective cluster representative that summarizes the
clustering result is required. In this paper, a mechanism called rough membership function-based similarity is
developed to allocate each unclustered categorical data point into the corresponding proper cluster [16]. The
allocating process goes by the name of data labeling and this method is also used to test the latest data point in
the current sliding window depending on whether the characteristics of clusters are similar to the last clustering
results or not[17, 19].
The paper is organized as follows. Section II supplies the relevant background to the study; section III deals
with the basic definitions and data labeling for drifting concept detection, while the section IV, concludes the
study with recommendations.
II. REVIEW OF RELATED LITERATURE
This section provides an exhaustive discussion of various clustering algorithms on categorical data along with
cluster representatives and data labeling [10, 11, 16, 20]. Cluster representative is used to summarize and
characterize the clustering result, which is not discussed in a detailed fashion in categorical domain unlike in the
www.iosrjournals.org 44 | Page
2. A Study in Employing Rough Set Based Approach for Clustering on Categorical Time-Evolving Data
numerical domain [21, 23]. In K-modes algorithm [24], the most frequent attribute value in each attribute
domain of a cluster represents what is known as a mode for that cluster. Finding modes may be simple, but the
method of using only one attribute value in each attribute domain to represent a cluster is questionable.
ROCK clustering algorithm [25] is a form of agglomerative hierarchical clustering algorithm. It is based on
links between data points, instead of distances between data points. The notion of links between data helps to
overcome the problems with distance based coefficients. The link between point i (pi ) and point j (pj), denoted as
link(pi,pj), and is defined as the number of common neighbors between pi and pj. ROCK's hierarchical clustering
algorithm accepts as input the set S of n sampled points as the representatives of those clusters, drawn randomly
from the original data set to be clustered, and the number of desired clusters k. The procedure begins by
computing the number of links between pairs of points. The number of links is then used in algorithm to cluster
the data set. The first step in implementing the algorithm is to create a Boolean matrix with entries 1 and 0 based
on adjacency matrix. The entry is 1 if the two corresponding points are adjacent neighbors or 0 if it is not. As
this algorithm simply focuses on the adjacent of every data point, some data points may be left out or ignored;
hence an algorithm based on entropy of the data points is assumed.
In the statistical categorical clustering algorithms [26, 28] such as COOLCAT [29] and LIMBO [30], data points
are grouped based on the statistics. In algorithm COOLCAT, data points are separated in such a way that the
expected entropy of the whole arrangements is minimized. In another algorithm LIMBO, the information
bottleneck method is applied to minimize the information lost which resulting from summarizing data points
into clusters. However, these algorithms perform clustering based on minimizing or maximizing the statistical
objective function, and the clustering representatives in these algorithms are not clearly defined. Therefore, the
summarization and characteristic information of the clustering results cannot be obtained by using these
algorithms. A different approach is called for, which is the aim of the paper.
III. DRIFTING CONCEPT DETECTION
In this section discussed various notations with the problem definition and also Data labeling based on rough
membership function.
III.1 Basic Notations
The problem of clustering the categorical time-evolving data is formulated as follows: if a series of categorical
data points D is given, where each data point is a vector of q attribute values is x j=(xj1, xj2, … xjq). Let A= (A1,
A2 … Aq), where Aa is the ath categorical attribute, 1≤a≤q. and N be window size. Divide the n data points into
equal size windows call this subset as St, at time t. i.e. first N data points of D are located in the first subset S 1.
The objective of our method is to find drifting data points between St and St+1.
Consider the following data set D={x1, x2,………..x30} of 30 data points shown below and if the sliding window
size is 15 then S1 contains first 15 data points and S2 contains next 15 data points shown in table 1. We divide
these first 15 data points into three clusters by using any clustering method shown in table 2. The data points
that are clustered are called clustered data points or labeled data points and the remaining are called unlabeled
data points. Our aim is to label the remaining 15 unlabeled data points which are belong to next sliding window
S2 and also to identify concept- drift occurred or not.
We define the following term which is considered in this method.
Node: A Node Ir is defined as attribute name + attributes value.
Basically a node is an attribute value, and two or more attribute values of different attributes may be identical,
where those attribute domains intersection is non-empty, which is possible in real life. To avoid this ambiguity,
we define node as not only with attribute value and but also with attribute name. For example, Nodes
[height=60-69] and [weight=60-69] are different nodes even though the attribute values of attributes height and
weight are same i.e.60-69. Because the attribute names height and weight are different, the nodes predictably are
different.
www.iosrjournals.org 45 | Page
3. A Study in Employing Rough Set Based Approach for Clustering on Categorical Time-Evolving Data
Table 1: A data set D with 30 data points divided into two sliding windows S1 and S2.
S1
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15
A1 a b c a a c c c a b c c c b a
A2 m m f m m f m f f m m f m m f
A3 c b c a c a a c b a c b b c a
2
S
x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30
A1 a c b a b c c a a b b c b a c
A2 m m f f f f f m m f m m m f f
A3 c a b c a a b b a c a c b b c
Table 2: Three clusters C1, C2 and C3 after performing a clustering method on S1.
C1 C2 C3
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15
a b c a a c c c a b c c c b a
m m f m m f m f f m m f m m f
c b c a c a a c b a c b b c a
III.2 Data labeling
In this section, data labeling for unlabeled data points is done through rough membership function based on the
similarity between existing cluster and this unlabeled data point.
To start with, some basic concepts are reviewed of rough set theory, such as information system, the
indiscernibility relation, rough membership function. Then, a novel similarity between an unlabeled data point
and a cluster is defined by considering the node importance values in a given cluster (i. e. the node frequency in
a given cluster and the distribution of nodes in different clusters).
Generally, the set of data points to be clustered is stored in a table, where each row (tuple) represents the fact
about an data point. We label this table as an information system. When all the attribute are categorical then this
information system is termed categorical information system and is defined as a quadruple IS = (U, A, V, f),
where U is the nonempty set of data points (or objects), called the universe, A is the nonempty set of attributes,
V is the union of all attribute domains, i.e., V = U aAVa , where Va is the domain of attribute a and it is finite
and unordered.
F: U XA→ V – a mapping called an information function such that for any x U and a A, f(x,a) Va.
In 1982 Pawlak introduced rough set theory is a kind of symbolic machine learning technology for categorical
information systems with uncertainty information [31, 32] and employing the notion of rough membership
function the rough outlier factor for outlier detection has been defined by Jiang, Sui, and Cao in [33].
To understand the present method , we discuss the following definitions of rough set theory.
Definition 1: Let IS = (U, A, V, f) be a categorical information system, for any attribute subset P A, a binary
relation IND(P), called indiscernibility relation, is defined as
IND( P) {( x, y) U U | a P, f ( x, a) f ( y, a)}
It is obvious that IND(P) is an equivalence relation on U and IND( P) IND({a}). Given P A, the
aP
x x
U U
relation IND(P) induces a partition of U, denoted by U/ IND( P) { | x U } , where denotes
P P
x
U
the equivalence class determined by x with respect to P, i.e., { y U | ( x, y) IND( P)} .
P
Definition 2: Let IS = (U, A, V, f) be a categorical information system, P A and X U. The
x P X
U
roughmembership function P
: U [0,1] is defined as P
( x)
x P
U ,X U ,X U
The rough membership function quantifies the degree of relative overlap between the set X and the equivalence
P to which x belongs.
U
class x
www.iosrjournals.org 46 | Page
4. A Study in Employing Rough Set Based Approach for Clustering on Categorical Time-Evolving Data
In classical set theory, an element either belongs to a set or it does not. The corresponding membership function
is the characteristic function for the set, i.e., the function takes values 1 and 0, respectively. However, the rough
membership function takes values between 0 and 1.
Definition 3: Let IS = (U, A, V, f) be a categorical information system, P A, U = S Q and S Q = . For
any x Q and X S, the rough membership function Q, X : Q [0,1] is defined as
P
x P X
S
, if x P
S
P
( x) {
x P
Q, X S
= 0, otherwise,
x
S
Where {u S | a P, f (u, a) f ( x, a)}
P
In Definition 3, th domain of the rough membership function is a subset Q of U, not the universe U. Moreover,
we only consider the equivalence class of x on set S with respect to attribute set P.
Definition 4: Let IS = (U, A, V, f) be a categorical information system, P A, U = S Q and S Q = Φ.
Suppose that a prior clustering result S = {c1, c2. . . ck} is given, where ci, 1≤ i ≤ k, is the ith cluster. For any
x∈Q, the similarity between an unlabeled object x and a cluster ci with respect to P is defined as
1 k c j
S P ( x, ci ) ma *(1 wa *logwa j )
c
aP log k j 1
x a c j
S
{u | f (u, a) f ( x, a), u ci }
, ma
c
Where wa j
x a
S
ci
And |ci| is the number of data points in the ith cluster. ma characterizes the importance of the attribute value f(x,a)
in the cluster ci with respect to attribute a. wacj considers the distribution of attribute value f(x,a) between
clusters [34]. Hence, SP(x, ci) considers both the intra cluster similarity and the inter-cluster similarity.
Example 1: Let S = {c1, c2, c3}, where c 1, c2 and c3 are the clusters shown in table 2 and P=A= {A1, A2, A3}.
According to the above Definition 4, it is clear that
1 3 c j
S P ( x16 , c1 ) ma *(1 wa *logwa j )
c
aP log 3 j 1
3 1 3 3 1 1 1 1 4 1 4 4 2 2 3 3
= (1 ( log log log )) + (1 ( log log log ))
5 log 3 5 5 5 5 5 5 5 log 3 9 9 9 9 9 9 +
3 1 3 3 1 1 2 2
(1 ( log log log )) =0.1560
5 log 3 6 6 6 6 6 6
1 3
S P ( x16 , c2 ) ma *(1 wa j *logwa j )
c c
aP log 3 j 1
1 1 3 3 1 1 1 1 2 1 4 4 2 2 3 3
= (1 ( log log log )) + (1 ( log log log )) +
5 log 3 5 5 5 5 5 5 5 log 3 9 9 9 9 9 9
1 1 3 3 1 1 2 2
(1 ( log log log )) =0.056
5 log 3 6 6 6 6 6 6
1 3
S P ( x16 , c3 ) ma *(1 wa j *logwa j )
c c
aP log 3 j 1
3 1 4 4 2 2 3 3 2 1 3 3 1 1 2 2
=+ (1 ( log log log )) + (1 ( log log log )) =0.079
5 log3 9 9 9 9 9 9 5 log 3 6 6 6 6 6 6
Since the maximum similarity is found with cluster c1. Therefore x16 can be allocated to the cluster c1.
www.iosrjournals.org 47 | Page
5. A Study in Employing Rough Set Based Approach for Clustering on Categorical Time-Evolving Data
Table 3: Unlabeled data points similarities with clusters c1, c2 and c3 and their class labels
Data point (xi) Cluster (cj ) Similarity Sp (xi, cj) Class Label
c1 0.1560
x16 c2 0.0560 C1*
c3 0.0790
c1 0.0525
x17 c2 0.0880 C2* or C3*
c3 0.0880
c1 0.0265
x18 c2 0.0583 C2*
c3 0.0531
c1 0.1440
x19 c2 0.1222 C1*
c3 0.0905
c1 0.02270
x20 c2 0.06016 C2*
c3 0.04540
c1 0.04091
x21 c2 0.11487 C2*
c3 0.09114
c1 0.04377
x22 c2 0.12060 C2*
c3 0.09401
c1 0.12224
x23 c2 0.05860 C1*
c3 0.06236
c1 0.11671 C1*
x24 c2 0.06361
c3 0.04879
c1 0.05904 C1* or C2* or C3*
x25 c2 0.05904
c3 0.05904
c1 0.03578 C2*
x26 c2 0.03676
c3 0.02186
c1 0.08930 C1*
x27 c2 0.08717
c3 0.06062
c1 0.03856 C1*
x28 c2 0.03160
c3 0.03536
c1 0.10760 C1*
x29 c2 0.08535
c3 0.06947
c1 0.07733 C2*
x30 c2 0.11392
c3 0.11281
Similarly, we can find similarity between the remaining unlabeled data point of sliding window S 2 with all the
clusters of first clustering result C1. The table 3 shows all those similarities and the class label of those unlabeled
www.iosrjournals.org 48 | Page
6. A Study in Employing Rough Set Based Approach for Clustering on Categorical Time-Evolving Data
data points obtained with maximum similarity. And the graph in fig.1 shows unlabeled data points of sliding
window S2 similarities with clusters c1, c2 and c3.
Obviously, x16, x19, x23, x24, x27, x28 and x29 are allocated to the cluster c1. x18, x20, x21, x22, x26 and x30 are
labeled to the cluster c2. But x17 can be allocated to either c2 or c3 and x25 can be allocated to any of the three
clusters c1, c2 and c3. The obtained temporal clustering result from sliding window S2 is shown in table 4.
Table 4: The temporal clustering result with three clusters c1, c2 and c3based on sliding window S2.
c1 c2 c2 c3 c1 c2 c3
x16 x19 x23 x24 x27 x28 x29 x18 x20 x21 x22 x26 x30 x17 x25
a a a a c b a b b c c b c c b
m f m m m m f f f f f m f m f
c c b a c b b b a a b a c a c
Now we compare previous clustering result with current clustering result to decide the concept drift is occurred
or not by using the equation (1).
Concept drift = {
---------------------(1)
Here we use two thresholds one η, a cluster difference threshold and other one is ε, a cluster variation threshold.
Suppose the cluster difference threshold is set 0.25 and the cluster variation threshold set 0.3.
In our example, the variation of the ratio of the data points between clusters c1 of S1 and c1 of S2 is |5/15-
7/15|=0.133< ε. The variation of the ratio of the data points between clusters c 2 of S1 and c2 of S2 is |5/15-
6/15|=0.06< ε and the variation of the ratio of the data points between clusters c3 of S1 and c3 of S2 is |5/15-
0/15|=0.333< ε. Therefore by equation (1) concept drift is yes because (0+0+1)/3=0.33> η=0.25.
III. 3 Presents data labeling algorithm based on rough membership function.
Algorithm 1:
Input: IS = (S Q, A, V, f), where S is a sampled data, Q is an unlabeled data set, k is the number of clusters;
Output: Each object of Q is labeled to the cluster that obtained the maximum similarity.
Method: 1. Generate a partition S = {c1, c2, . . . ,ck} of S with respect to A by calling the corresponding
categorical clustering algorithm;
2. For i = 1 to |Q|
3. For j = 1 to k
www.iosrjournals.org 49 | Page
7. A Study in Employing Rough Set Based Approach for Clustering on Categorical Time-Evolving Data
4. Calculate the similarity between the ith object and the jth cluster according to Definition 4, and the ith
object is labeled to the cluster that obtained the maximum similarity;
5. For end
6. For end
The runtime complexity of the above presented algorithm depends on the complexity of clustering algorithm
which is used for initial clustering. Generally any clustering algorithm takes O(knq) runtime complexity, where
k is the number of clusters, n is the total number of data points to cluster and q is the number of attributes. The
runtime complexity for computing the similarity between arbitrary unlabeled data point and a cluster is O
(|S||P|). Therefore, the whole computational cost of the proposed algorithm is O (S||P||Q||k|).
IV. CONCLUSION
In categorical domain, the problem of how to allocate the unlabeled data points into appropriate clusters has not
been fully explored in the previous works. Besides, for the data which changes over time, clustering this type of
data not only decreases the quality of clusters and also disregards the expectations of users, when usually require
recent clustering results. In this paper, based on the rough membership function and the frequency of the node, a
new similarity measure for allocating the unlabeled data point into appropriate cluster has been defined. Since
this similarity measure has two characteristics, one pertaining to the distribution of the node in the different
cluster and second is of the probability of the node in given cluster, which consider both the intra-cluster
similarity and the inter- cluster similarity, the algorithm for remedying unaddressed problem has been presented
for data labeling while making allowance for time complexity.
V. ACKNOWLEDGEMENTS
We wish to thank our supervisor Dr.Vinay Babu, JNTUniversity Hyderabad, India and all our friends who
contributed to the paper.
REFERENCES
[1]. Anil K. Jain and Richard C. Dubes. ―Algorithms for Clustering Data‖, Prentice-Hall International, 1988.
[2]. Jain A K MN Murthy and P J Flyn, ―Data Clustering: A Review,‖ ACM Computing Survey, 1999.
[3]. Kaufman L, P. Rousseuw,‖ Finding Groups in Data- An Introduction to Cluster Analysis‖, Wiley Series in Probability and
Math. Sciences, 1990.
[4]. Han,J. and Kamber,M. ―Data Mining Concepts and Techniques‖, Morgan Kaufmann, 2001.
[5]. Bradley,P.S., Usama Fayyad, and Cory Reina,‖ Scaling clustering algorithms to large databases‖, Fourth International Conference
on Knowledge Discovery and Data Mining, 1998.
[6]. Joydeep Ghosh. Scalable clustering methods for data mining. In Nong Ye, editor, ―Handbook of Data Mining‖, chapter 10, pp. 24 7–
277. Lawrence Ealbaum Assoc, 2003.
[7]. Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani, and Liadan O’Callaghan,‖ Clustering data streams: Theory and
practice‖, IEEE Transactions on Knowledge and Data Engineering, PP.515–528, 2003.
[8]. Gibson, D., Kleinberg, J.M. and Raghavan,P. ―Clustering Categorical Data An Approach Based on Dynamical Systems‖, VLDB
pp. 3-4, pp. 222-236, 2000.
[9]. Michael R. Anderberg,‖ Cluster analysis for applications‖, Academic Press, 1973.
[10]. Chen H.L, M.-S. Chen, and S-U Chen Lin ―Frame work for clustering Concept –Drifting categorical data‖, IEEE Transaction
Knowledge and Data Engineering v21 no 5 , 2009.
[11]. Klinkenberg, R.,‖ Using labeled and unlabeled data to learn drifting concepts‖, IJCAI-01Workshop on Learning from Temporal and
Spatial Data, pp. 16-24, 2001.
[12]. Aggarwal, C., Han, J., Wang, J. and Yu P, ―A Framework for Clustering Evolving Data Streams‖, Very Large Data Bases (VLDB),
2003.
[13]. Aggarwal, C., Wolf, J.L. , Yu, P.S. , Procopiuc, C. and Park, J.S. ―Fast Algorithms for Projected Clustering,‖, ACM SIGMOD ’99,
pp. 61-72, 1999.
[14]. Liang, J. Y., Wang, J. H., & Qian, Y. H. (2009). A new measure of uncertainty based on knowledge granulation for rough sets.
Information Sciences, 179(4), 458–470.
[15]. Shannon, C.E, ―A Mathematical Theory of Communication,‖ Bell System Technical J., 1948.
[16]. Chen, H.L., Chuang, K.T. and Chen, M.S. ―Labeling Un clustered Categorical Data into Clusters Based on the Important Attribute
Values‖, IEEE International Conference. Data Mining (ICDM), 2005.
[17]. Venkateswara Reddy.H, Viswanadha Raju.S,‖ Our-NIR: Node Importance Representative for Clustering of Categorical Data‖,
International Journal of Computer Science and Technology , pp. 80-82,2011.
[18]. Venkateswara Reddy.H, Viswanadha Raju.S,‖ POur-NIR: Modified Node Importance Representative for Clustering of Categorical
Data‖, International Journal of Computer Science and Information Security, pp.146-150, 2011.
[19]. Venkateswara Reddy.H, Viswanadha Raju.S,‖ A Threshold for clustering Concept – Drifting Categorical Data‖, IEEE Computer
Society, ICMLC 2011.
[20]. Venkateswara Reddy.H, Viswanadha Raju.S, ―Clustering of Concept Drift Categorical Data Using Our-NIR Method‖, International
Journal of Computer and Electrical Engineering, Vol. 3, No. 6, December 2011.
[21]. Tian Zhang, Raghu Ramakrishnan, and Miron Livny,‖ BIRCH: An Efficient Data Clustering Method for Very Large
Databases‖,ACM SIGMOD International Conference on Management of Data,1996.
[22]. S. Guha, R. Rastogi, K. Shim. CURE,‖ An Efficient Clustering Algorithm for Large Databases‖, ACM SIGMOD International
Conference on Management of Data, pp.73-84, 1998.
[23]. Ng, R.T. Jiawei Han ―CLARANS: a method for clustering objects for spatial data mining‖, Knowledge and Data Engineering,
IEEE Transactions, 2002.
[24]. Huang, Z. and Ng, M.K, ―A Fuzzy k-Modes Algorithm for Clustering Categorical Data‖ IEEE On Fuzzy Systems, 1999.
www.iosrjournals.org 50 | Page
8. A Study in Employing Rough Set Based Approach for Clustering on Categorical Time-Evolving Data
[25]. Guha,S., Rastogi,R. and Shim, K, ―ROCK: A Robust Clustering Algorithm for Categorical Attributes‖, International Conference On
Data Eng. (ICDE), 1999.
[26]. Vapnik, V.N,‖ The nature of statistical learning theory‖, Springer,1995.
[27]. Fredrik Farnstrom, James Lewis, and Charles Elkan,‖ Scalability for clustering algorithms revisited‖, ACM SIGKDD pp.:51–57,
2000.
[28]. Ganti, V., Gehrke, J. and Ramakrishnan, R, ―CACTUS—Clustering Categorical Data Using Summaries,‖ ACM SIGKDD, 1999.
[29]. Barbara, D., Li, Y. and Couto, J. ―Coolcat: An Entropy-Based Algorithm for Categorical Clustering‖, ACM International Conf.
Information and Knowledge Management (CIKM), 2002.
[30]. Andritsos, P, Tsaparas, P, Miller R.J and Sevcik, K.C.―Limbo: Scalable Clustering of Categorical Data‖, Extending Database
Technology (EDBT), 2004.
[31]. Liang, J. Y., & Li, D. Y. (2005). Uncertainty and knowledge acquisition in information systems. Beijing, China: Science Press.
[32]. Liang, J. Y., Wang, J. H., & Qian, Y. H. (2009). A new measure of uncertainty based on knowledge granulation for rough sets.
Information Sciences, 179(4), 458–470.
[33]. Jiang, F., Sui, Y. F., & Cao, C. G. (2008). A rough set approach to outlier detection. International Journal of General Systems,
37(5), 519–536.
[34]. Gluck, M.A. and Corter, J.E. ―Information Uncertainty and the Utility of Categories‖, Cognitive Science Society, pp. 283-287,
1985.
www.iosrjournals.org 51 | Page