This document summarizes a research paper that proposes a new dimension-reduced weighted fuzzy clustering algorithm (sWFCM-HD) for high-dimensional streaming data. The algorithm can cluster datasets that have both high dimensionality and a streaming (continuously arriving) nature. It combines previous work on clustering algorithms for streaming data and high-dimensional data. The paper introduces the algorithm and compares it experimentally to show improvements in memory usage and runtime over other approaches for these types of datasets.
Clustering is also known as data segmentation aims to partitions data set into groups, clusters, according to their similarity. Cluster analysis has been extensively studied in many researches. There are many algorithms for different types of clustering. These classical algorithms can't be applied on big data due to its distinct features. It is a challenge to apply the traditional techniques on large unstructured data. This study proposes a hybrid model to cluster big data using the famous traditional K-means clustering algorithm. The proposed model consists of three phases namely; Mapper phase, Clustering Phase and Reduce phase. The first phase uses map-reduce algorithm to split big data into small datasets. Whereas, the second phase implements the traditional clustering K-means algorithm on each of the spitted small data sets. The last phase is responsible of producing the general clusters output of the complete data set. Two functions, Mode and Fuzzy Gaussian, have been implemented and compared at the last phase to determine the most suitable one. The experimental study used four benchmark big data sets; Covtype, Covtype-2, Poker, and Poker-2. The results proved the efficiency of the proposed model in clustering big data using the traditional K-means algorithm. Also, the experiments show that the Fuzzy Gaussian function produces more accurate results than the traditional Mode function.
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...cscpconf
For performing distributed data mining two approaches are possible: First, data from several sources are copied to a data warehouse and mining algorithms are applied in it. Secondly,
mining can performed at the local sites and the results can be aggregated. When the number of
features is high, a lot of bandwidth is consumed in transferring datasets to a centralized location. For this dimensionality reduction can be done at the local sites. In dimensionality reduction a certain encoding is applied on data so as to obtain its compressed form. The
reduced features thus obtained at the local sites are aggregated and data mining algorithms are applied on them. There are several methods of performing dimensionality reduction. Two most important ones are Discrete Wavelet Transforms (DWT) and Principal Component Analysis (PCA). Here a detailed study is done on how PCA could be useful in reducing data flow across a distributed network.
A Novel Approach for Clustering Big Data based on MapReduce IJECEIAES
Clustering is one of the most important applications of data mining. It has attracted attention of researchers in statistics and machine learning. It is used in many applications like information retrieval, image processing and social network analytics etc. It helps the user to understand the similarity and dissimilarity between objects. Cluster analysis makes the users understand complex and large data sets more clearly. There are different types of clustering algorithms analyzed by various researchers. Kmeans is the most popular partitioning based algorithm as it provides good results because of accurate calculation on numerical data. But Kmeans give good results for numerical data only. Big data is combination of numerical and categorical data. Kprototype algorithm is used to deal with numerical as well as categorical data. Kprototype combines the distance calculated from numeric and categorical data. With the growth of data due to social networking websites, business transactions, scientific calculation etc., there is vast collection of structured, semi-structured and unstructured data. So, there is need of optimization of Kprototype so that these varieties of data can be analyzed efficiently.In this work, Kprototype algorithm is implemented on MapReduce in this paper. Experiments have proved that Kprototype implemented on Mapreduce gives better performance gain on multiple nodes as compared to single node. CPU execution time and speedup are used as evaluation metrics for comparison.Intellegent splitter is proposed in this paper which splits mixed big data into numerical and categorical data. Comparison with traditional algorithms proves that proposed algorithm works better for large scale of data.
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval IJECEIAES
Data mining is an essential process for identifying the patterns in large datasets through machine learning techniques and database systems. Clustering of high dimensional data is becoming very challenging process due to curse of dimensionality. In addition, space complexity and data retrieval performance was not improved. In order to overcome the limitation, Spectral Clustering Based VP Tree Indexing Technique is introduced. The technique clusters and indexes the densely populated high dimensional data points for effective data retrieval based on user query. A Normalized Spectral Clustering Algorithm is used to group similar high dimensional data points. After that, Vantage Point Tree is constructed for indexing the clustered data points with minimum space complexity. At last, indexed data gets retrieved based on user query using Vantage Point Tree based Data Retrieval Algorithm. This in turn helps to improve true positive rate with minimum retrieval time. The performance is measured in terms of space complexity, true positive rate and data retrieval time with El Nino weather data sets from UCI Machine Learning Repository. An experimental result shows that the proposed technique is able to reduce the space complexity by 33% and also reduces the data retrieval time by 24% when compared to state-of-the-artworks.
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
The k- Means clustering algorithm is an old algorithm that has been intensely researched owing to its ease
and simplicity of implementation. Clustering algorithm has a broad attraction and usefulness in
exploratory data analysis. This paper presents results of the experimental study of different approaches to
k- Means clustering, thereby comparing results on different datasets using Original k-Means and other
modified algorithms implemented using MATLAB R2009b. The results are calculated on some performance
measures such as no. of iterations, no. of points misclassified, accuracy, Silhouette validity index and
execution time
Clustering is also known as data segmentation aims to partitions data set into groups, clusters, according to their similarity. Cluster analysis has been extensively studied in many researches. There are many algorithms for different types of clustering. These classical algorithms can't be applied on big data due to its distinct features. It is a challenge to apply the traditional techniques on large unstructured data. This study proposes a hybrid model to cluster big data using the famous traditional K-means clustering algorithm. The proposed model consists of three phases namely; Mapper phase, Clustering Phase and Reduce phase. The first phase uses map-reduce algorithm to split big data into small datasets. Whereas, the second phase implements the traditional clustering K-means algorithm on each of the spitted small data sets. The last phase is responsible of producing the general clusters output of the complete data set. Two functions, Mode and Fuzzy Gaussian, have been implemented and compared at the last phase to determine the most suitable one. The experimental study used four benchmark big data sets; Covtype, Covtype-2, Poker, and Poker-2. The results proved the efficiency of the proposed model in clustering big data using the traditional K-means algorithm. Also, the experiments show that the Fuzzy Gaussian function produces more accurate results than the traditional Mode function.
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...cscpconf
For performing distributed data mining two approaches are possible: First, data from several sources are copied to a data warehouse and mining algorithms are applied in it. Secondly,
mining can performed at the local sites and the results can be aggregated. When the number of
features is high, a lot of bandwidth is consumed in transferring datasets to a centralized location. For this dimensionality reduction can be done at the local sites. In dimensionality reduction a certain encoding is applied on data so as to obtain its compressed form. The
reduced features thus obtained at the local sites are aggregated and data mining algorithms are applied on them. There are several methods of performing dimensionality reduction. Two most important ones are Discrete Wavelet Transforms (DWT) and Principal Component Analysis (PCA). Here a detailed study is done on how PCA could be useful in reducing data flow across a distributed network.
A Novel Approach for Clustering Big Data based on MapReduce IJECEIAES
Clustering is one of the most important applications of data mining. It has attracted attention of researchers in statistics and machine learning. It is used in many applications like information retrieval, image processing and social network analytics etc. It helps the user to understand the similarity and dissimilarity between objects. Cluster analysis makes the users understand complex and large data sets more clearly. There are different types of clustering algorithms analyzed by various researchers. Kmeans is the most popular partitioning based algorithm as it provides good results because of accurate calculation on numerical data. But Kmeans give good results for numerical data only. Big data is combination of numerical and categorical data. Kprototype algorithm is used to deal with numerical as well as categorical data. Kprototype combines the distance calculated from numeric and categorical data. With the growth of data due to social networking websites, business transactions, scientific calculation etc., there is vast collection of structured, semi-structured and unstructured data. So, there is need of optimization of Kprototype so that these varieties of data can be analyzed efficiently.In this work, Kprototype algorithm is implemented on MapReduce in this paper. Experiments have proved that Kprototype implemented on Mapreduce gives better performance gain on multiple nodes as compared to single node. CPU execution time and speedup are used as evaluation metrics for comparison.Intellegent splitter is proposed in this paper which splits mixed big data into numerical and categorical data. Comparison with traditional algorithms proves that proposed algorithm works better for large scale of data.
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval IJECEIAES
Data mining is an essential process for identifying the patterns in large datasets through machine learning techniques and database systems. Clustering of high dimensional data is becoming very challenging process due to curse of dimensionality. In addition, space complexity and data retrieval performance was not improved. In order to overcome the limitation, Spectral Clustering Based VP Tree Indexing Technique is introduced. The technique clusters and indexes the densely populated high dimensional data points for effective data retrieval based on user query. A Normalized Spectral Clustering Algorithm is used to group similar high dimensional data points. After that, Vantage Point Tree is constructed for indexing the clustered data points with minimum space complexity. At last, indexed data gets retrieved based on user query using Vantage Point Tree based Data Retrieval Algorithm. This in turn helps to improve true positive rate with minimum retrieval time. The performance is measured in terms of space complexity, true positive rate and data retrieval time with El Nino weather data sets from UCI Machine Learning Repository. An experimental result shows that the proposed technique is able to reduce the space complexity by 33% and also reduces the data retrieval time by 24% when compared to state-of-the-artworks.
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
The k- Means clustering algorithm is an old algorithm that has been intensely researched owing to its ease
and simplicity of implementation. Clustering algorithm has a broad attraction and usefulness in
exploratory data analysis. This paper presents results of the experimental study of different approaches to
k- Means clustering, thereby comparing results on different datasets using Original k-Means and other
modified algorithms implemented using MATLAB R2009b. The results are calculated on some performance
measures such as no. of iterations, no. of points misclassified, accuracy, Silhouette validity index and
execution time
Extended pso algorithm for improvement problems k means clustering algorithmIJMIT JOURNAL
The clustering is a without monitoring process and one of the most common data mining techniques. The
purpose of clustering is grouping similar data together in a group, so were most similar to each other in a
cluster and the difference with most other instances in the cluster are. In this paper we focus on clustering
partition k-means, due to ease of implementation and high-speed performance of large data sets, After 30
year it is still very popular among the developed clustering algorithm and then for improvement problem of
placing of k-means algorithm in local optimal, we pose extended PSO algorithm, that its name is ECPSO.
Our new algorithm is able to be cause of exit from local optimal and with high percent produce the
problem’s optimal answer. The probe of results show that mooted algorithm have better performance
regards as other clustering algorithms specially in two index, the carefulness of clustering and the quality
of clustering.
In recent machine learning community, there is a trend of constructing a linear logarithm version of
nonlinear version through the ‘kernel method’ for example kernel principal component analysis, kernel
fisher discriminant analysis, support Vector Machines (SVMs), and the current kernel clustering
algorithms. Typically, in unsupervised methods of clustering algorithms utilizing kernel method, a
nonlinear mapping is operated initially in order to map the data into a much higher space feature, and then
clustering is executed. A hitch of these kernel clustering algorithms is that the clustering prototype resides
in increased features specs of dimensions and therefore lack intuitive and clear descriptions without
utilizing added approximation of projection from the specs to the data as executed in the literature
presented. This paper aims to utilize the ‘kernel method’, a novel clustering algorithm, founded on the
conventional fuzzy clustering algorithm (FCM) is anticipated and known as kernel fuzzy c-means algorithm
(KFCM). This method embraces a novel kernel-induced metric in the space of data in order to interchange
the novel Euclidean matric norm in cluster prototype and fuzzy clustering algorithm still reside in the space
of data so that the results of clustering could be interpreted and reformulated in the spaces which are
original. This property is used for clustering incomplete data. Execution on supposed data illustrate that
KFCM has improved performance of clustering and stout as compare to other transformations of FCM for
clustering incomplete data.
K-Means clustering uses an iterative procedure which is very much sensitive and dependent upon the initial centroids. The initial centroids in the k-means clustering are chosen randomly, and hence the clustering also changes with respect to the initial centroids. This paper tries to overcome this problem of random selection of centroids and hence change of clusters with a premeditated selection of initial centroids. We have used the iris, abalone and wine data sets to demonstrate that the proposed method of finding the initial centroids and using the centroids in k-means algorithm improves the clustering performance. The clustering also remains the same in every run as the initial centroids are not randomly selected but through premeditated method.
Drsp dimension reduction for similarity matching and pruning of time series ...IJDKP
Similarity matching and join of time series data streams has gained a lot of relevance in today’s world that
has large streaming data. This process finds wide scale application in the areas of location tracking,
sensor networks, object positioning and monitoring to name a few. However, as the size of the data stream
increases, the cost involved to retain all the data in order to aid the process of similarity matching also
increases. We develop a novel framework to addresses the following objectives. Firstly, Dimension
reduction is performed in the preprocessing stage, where large stream data is segmented and reduced into
a compact representation such that it retains all the crucial information by a technique called Multi-level
Segment Means (MSM). This reduces the space complexity associated with the storage of large time-series
data streams. Secondly, it incorporates effective Similarity Matching technique to analyze if the new data
objects are symmetric to the existing data stream. And finally, the Pruning Technique that filters out the
pseudo data object pairs and join only the relevant pairs. The computational cost for MSM is O(l*ni) and
the cost for pruning is O(DRF*wsize*d), where DRF is the Dimension Reduction Factor. We have
performed exhaustive experimental trials to show that the proposed framework is both efficient and
competent in comparison with earlier works.
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETScsandit
The ability to mine and extract useful information automatically, from large datasets, is a
common concern for organizations (having large datasets), over the last few decades. Over the
internet, data is vastly increasing gradually and consequently the capacity to collect and store
very large data is significantly increasing.
Existing clustering algorithms are not always efficient and accurate in solving clustering
problems for large datasets.
However, the development of accurate and fast data classification algorithms for very large
scale datasets is still a challenge. In this paper, various algorithms and techniques especially,
approach using non-smooth optimization formulation of the clustering problem, are proposed
for solving the minimum sum-of-squares clustering problems in very large datasets. This
research also develops accurate and real time L2-DC algorithm based with the incremental
approach to solve the minimum
Information extraction from data is one of the key necessities for data analysis. Unsupervised nature of data leads to complex computational methods for analysis. This paper presents a density based spatial clustering technique integrated with one-class Support Vector Machine (SVM), a machine learning technique for noise reduction, a modified variant of DBSCAN called Noise Reduced DBSCAN (NRDBSCAN). Analysis of DBSCAN exhibits its major requirement of accurate thresholds, absence of which yields suboptimal results. However, identifying accurate threshold settings is unattainable. Noise is one of the major side-effects of the threshold gap. The proposed work reduces noise by integrating a machine learning classifier into the operation structure of DBSCAN. The Experimental results indicate high homogeneity levels in the clustering process.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Scalable Rough C-Means clustering using Firefly algorithm..................................................................1
Abhilash Namdev and B.K. Tripathy
Significance of Embedded Systems to IoT................................................................................................. 15
P. R. S. M. Lakshmi, P. Lakshmi Narayanamma and K. Santhi Sri
Cognitive Abilities, Information Literacy Knowledge and Retrieval Skills of Undergraduates: A
Comparison of Public and Private Universities in Nigeria ........................................................................ 24
Janet O. Adekannbi and Testimony Morenike Oluwayinka
Risk Assessment in Constructing Horseshoe Vault Tunnels using Fuzzy Technique................................ 48
Erfan Shafaghat and Mostafa Yousefi Rad
Evaluating the Adoption of Deductive Database Technology in Augmenting Criminal Intelligence in
Zimbabwe: Case of Zimbabwe Republic Police......................................................................................... 68
Mahlangu Gilbert, Furusa Samuel Simbarashe, Chikonye Musafare and Mugoniwa Beauty
Analysis of Petrol Pumps Reachability in Anand District of Gujarat ....................................................... 77
Nidhi Arora
Extended pso algorithm for improvement problems k means clustering algorithmIJMIT JOURNAL
The clustering is a without monitoring process and one of the most common data mining techniques. The
purpose of clustering is grouping similar data together in a group, so were most similar to each other in a
cluster and the difference with most other instances in the cluster are. In this paper we focus on clustering
partition k-means, due to ease of implementation and high-speed performance of large data sets, After 30
year it is still very popular among the developed clustering algorithm and then for improvement problem of
placing of k-means algorithm in local optimal, we pose extended PSO algorithm, that its name is ECPSO.
Our new algorithm is able to be cause of exit from local optimal and with high percent produce the
problem’s optimal answer. The probe of results show that mooted algorithm have better performance
regards as other clustering algorithms specially in two index, the carefulness of clustering and the quality
of clustering.
In recent machine learning community, there is a trend of constructing a linear logarithm version of
nonlinear version through the ‘kernel method’ for example kernel principal component analysis, kernel
fisher discriminant analysis, support Vector Machines (SVMs), and the current kernel clustering
algorithms. Typically, in unsupervised methods of clustering algorithms utilizing kernel method, a
nonlinear mapping is operated initially in order to map the data into a much higher space feature, and then
clustering is executed. A hitch of these kernel clustering algorithms is that the clustering prototype resides
in increased features specs of dimensions and therefore lack intuitive and clear descriptions without
utilizing added approximation of projection from the specs to the data as executed in the literature
presented. This paper aims to utilize the ‘kernel method’, a novel clustering algorithm, founded on the
conventional fuzzy clustering algorithm (FCM) is anticipated and known as kernel fuzzy c-means algorithm
(KFCM). This method embraces a novel kernel-induced metric in the space of data in order to interchange
the novel Euclidean matric norm in cluster prototype and fuzzy clustering algorithm still reside in the space
of data so that the results of clustering could be interpreted and reformulated in the spaces which are
original. This property is used for clustering incomplete data. Execution on supposed data illustrate that
KFCM has improved performance of clustering and stout as compare to other transformations of FCM for
clustering incomplete data.
K-Means clustering uses an iterative procedure which is very much sensitive and dependent upon the initial centroids. The initial centroids in the k-means clustering are chosen randomly, and hence the clustering also changes with respect to the initial centroids. This paper tries to overcome this problem of random selection of centroids and hence change of clusters with a premeditated selection of initial centroids. We have used the iris, abalone and wine data sets to demonstrate that the proposed method of finding the initial centroids and using the centroids in k-means algorithm improves the clustering performance. The clustering also remains the same in every run as the initial centroids are not randomly selected but through premeditated method.
Drsp dimension reduction for similarity matching and pruning of time series ...IJDKP
Similarity matching and join of time series data streams has gained a lot of relevance in today’s world that
has large streaming data. This process finds wide scale application in the areas of location tracking,
sensor networks, object positioning and monitoring to name a few. However, as the size of the data stream
increases, the cost involved to retain all the data in order to aid the process of similarity matching also
increases. We develop a novel framework to addresses the following objectives. Firstly, Dimension
reduction is performed in the preprocessing stage, where large stream data is segmented and reduced into
a compact representation such that it retains all the crucial information by a technique called Multi-level
Segment Means (MSM). This reduces the space complexity associated with the storage of large time-series
data streams. Secondly, it incorporates effective Similarity Matching technique to analyze if the new data
objects are symmetric to the existing data stream. And finally, the Pruning Technique that filters out the
pseudo data object pairs and join only the relevant pairs. The computational cost for MSM is O(l*ni) and
the cost for pruning is O(DRF*wsize*d), where DRF is the Dimension Reduction Factor. We have
performed exhaustive experimental trials to show that the proposed framework is both efficient and
competent in comparison with earlier works.
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETScsandit
The ability to mine and extract useful information automatically, from large datasets, is a
common concern for organizations (having large datasets), over the last few decades. Over the
internet, data is vastly increasing gradually and consequently the capacity to collect and store
very large data is significantly increasing.
Existing clustering algorithms are not always efficient and accurate in solving clustering
problems for large datasets.
However, the development of accurate and fast data classification algorithms for very large
scale datasets is still a challenge. In this paper, various algorithms and techniques especially,
approach using non-smooth optimization formulation of the clustering problem, are proposed
for solving the minimum sum-of-squares clustering problems in very large datasets. This
research also develops accurate and real time L2-DC algorithm based with the incremental
approach to solve the minimum
Information extraction from data is one of the key necessities for data analysis. Unsupervised nature of data leads to complex computational methods for analysis. This paper presents a density based spatial clustering technique integrated with one-class Support Vector Machine (SVM), a machine learning technique for noise reduction, a modified variant of DBSCAN called Noise Reduced DBSCAN (NRDBSCAN). Analysis of DBSCAN exhibits its major requirement of accurate thresholds, absence of which yields suboptimal results. However, identifying accurate threshold settings is unattainable. Noise is one of the major side-effects of the threshold gap. The proposed work reduces noise by integrating a machine learning classifier into the operation structure of DBSCAN. The Experimental results indicate high homogeneity levels in the clustering process.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Scalable Rough C-Means clustering using Firefly algorithm..................................................................1
Abhilash Namdev and B.K. Tripathy
Significance of Embedded Systems to IoT................................................................................................. 15
P. R. S. M. Lakshmi, P. Lakshmi Narayanamma and K. Santhi Sri
Cognitive Abilities, Information Literacy Knowledge and Retrieval Skills of Undergraduates: A
Comparison of Public and Private Universities in Nigeria ........................................................................ 24
Janet O. Adekannbi and Testimony Morenike Oluwayinka
Risk Assessment in Constructing Horseshoe Vault Tunnels using Fuzzy Technique................................ 48
Erfan Shafaghat and Mostafa Yousefi Rad
Evaluating the Adoption of Deductive Database Technology in Augmenting Criminal Intelligence in
Zimbabwe: Case of Zimbabwe Republic Police......................................................................................... 68
Mahlangu Gilbert, Furusa Samuel Simbarashe, Chikonye Musafare and Mugoniwa Beauty
Analysis of Petrol Pumps Reachability in Anand District of Gujarat ....................................................... 77
Nidhi Arora
A h k clustering algorithm for high dimensional data using ensemble learningijitcs
Advances made to the traditional clustering algorithms solves the various problems such as curse of
dimensionality and sparsity of data for multiple attributes. The traditional H-K clustering algorithm can
solve the randomness and apriority of the initial centers of K-means clustering algorithm. But when we
apply it to high dimensional data it causes the dimensional disaster problem due to high computational
complexity. All the advanced clustering algorithms like subspace and ensemble clustering algorithms
improve the performance for clustering high dimension dataset from different aspects in different extent.
Still these algorithms will improve the performance form a single perspective. The objective of the
proposed model is to improve the performance of traditional H-K clustering and overcome the limitations
such as high computational complexity and poor accuracy for high dimensional data by combining the
three different approaches of clustering algorithm as subspace clustering algorithm and ensemble
clustering algorithm with H-K clustering algorithm.
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
Apriori is one of the key algorithms to generate frequent itemsets. Analysing frequent itemset is a crucial
step in analysing structured data and in finding association relationship between items. This stands as an
elementary foundation to supervised learning, which encompasses classifier and feature extraction
methods. Applying this algorithm is crucial to understand the behaviour of structured data. Most of the
structured data in scientific domain are voluminous. Processing such kind of data requires state of the art
computing machines. Setting up such an infrastructure is expensive. Hence a distributed environment
such as a clustered setup is employed for tackling such scenarios. Apache Hadoop distribution is one of
the cluster frameworks in distributed environment that helps by distributing voluminous data across a
number of nodes in the framework. This paper focuses on map/reduce design and implementation of
Apriori algorithm for structured data analysis.
Machine learning in Dynamic Adaptive Streaming over HTTP (DASH)Eswar Publications
Recently machine learning has been introduced into the area of adaptive video streaming. This paper explores a novel taxonomy that includes six state of the art techniques of machine learning that have been applied to Dynamic Adaptive Streaming over HTTP (DASH): (1) Q-learning, (2) Reinforcement learning, (3) Regression, (4) Classification, (5) Decision Tree learning, and (6) Neural networks.
Recently graph data rises in many applications and there is need to manage such large amount of data by performing various graph operations over graphs using some graph search queries. Many approaches and algorithms serve this purpose but continuously require improvement over it in terms of stability and performance. Such approaches are less efficient when large and complex data is involved. Applications need to execute faster in order to improve overall performance of the system and need to perform many
advanced and complex operations. Shortest path estimation is one of the key search queries in many applications. Here we present a system which will find the shortest path between nodes and contribute to performance of the system with the help of different shortest path algorithms such as bidirectional search and AStar algorithm and takes a relational approach using some new standard SQL queries to solve the
problem, utilizing advantages of relational database which solves the problem efficiently.
T AXONOMY OF O PTIMIZATION A PPROACHES OF R ESOURCE B ROKERS IN D ATA G RIDSijcsit
A novel taxonomy of replica selection techniques is proposed. We studied some data grid approaches
where the selection strategies of data management is different. The aim of the study is to determine the
common concepts and observe their performance and to compare their performance with our strategy
A time efficient and accurate retrieval of range aggregate queries using fuzz...IJECEIAES
Massive growth in the big data makes difficult to analyse and retrieve the useful information from the set of available data’s. Existing approaches cannot guarantee an efficient retrieval of data from the database. In the existing work stratified sampling is used to partition the tables in terms of stratic variables. However k means clustering algorithm cannot guarantees an efficient retrieval where the choosing centroid in the large volume of data would be difficult. And less knowledge about the stratic variable might leads to the less efficient partitioning of tables. This problem is overcome in the proposed methodology by introducing the FCM clustering instead of k means clustering which can cluster the large volume of data which are similar in nature. Stratification problem is overcome by introducing the post stratification approach which will leads to efficient selection of stratic variable. This methodology leads to an efficient retrieval process in terms of user query within less time and more accuracy.
AN ENTROPIC OPTIMIZATION TECHNIQUE IN HETEROGENEOUS GRID COMPUTING USING BION...ijcsit
The wide usage of the Internet and the availability of powerful computers and high-speed networks as low cost
commodity components have a deep impact on the way we use computers today, in such a way that
these technologies facilitated the usage of multi-owner and geographically distributed resources to address
large-scale problems in many areas such as science, engineering, and commerce. The new paradigm of
Grid computing has evolved from these researches on these topics. Performance and utilization of the grid
depends on a complex and excessively dynamic procedure of optimally balancing the load among the
available nodes. In this paper, we suggest a novel two-dimensional figure of merit that depict the network
effects on load balance and fault tolerance estimation to improve the performance of the network
utilizations. The enhancement of fault tolerance is obtained by adaptively decrease replication time and
message cost. On the other hand, load balance is improved by adaptively decrease mean job response time.
Finally, analysis of Genetic Algorithm, Ant Colony Optimization, and Particle Swarm Optimization is
conducted with regards to their solutions, issues and improvements concerning load balancing in
computational grid. Consequently, a significant system utilization improvement was attained. Experimental
results eventually demonstrate that the proposed method's performance surpasses other methods.
In the present day huge amount of data is generated in every minute and transferred frequently. Although
the data is sometimes static but most commonly it is dynamic and transactional. New data that is being
generated is getting constantly added to the old/existing data. To discover the knowledge from this
incremental data, one approach is to run the algorithm repeatedly for the modified data sets which is time
consuming. Again to analyze the datasets properly, construction of efficient classifier model is necessary.
The objective of developing such a classifier is to classify unlabeled dataset into appropriate classes. The
paper proposes a dimension reduction algorithm that can be applied in dynamic environment for
generation of reduced attribute set as dynamic reduct, and an optimization algorithm which uses the
reduct and build up the corresponding classification system. The method analyzes the new dataset, when it
becomes available, and modifies the reduct accordingly to fit the entire dataset and from the entire data
set, interesting optimal classification rule sets are generated. The concepts of discernibility relation,
attribute dependency and attribute significance of Rough Set Theory are integrated for the generation of
dynamic reduct set, and optimal classification rules are selected using PSO method, which not only
reduces the complexity but also helps to achieve higher accuracy of the decision system. The proposed
method has been applied on some benchmark dataset collected from the UCI repository and dynamic
reduct is computed, and from the reduct optimal classification rules are also generated. Experimental
result shows the efficiency of the proposed method.
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSijdkp
Subspace clustering discovers the clusters embedded in multiple, overlapping subspaces of high
dimensional data. Many significant subspace clustering algorithms exist, each having different
characteristics caused by the use of different techniques, assumptions, heuristics used etc. A comprehensive
classification scheme is essential which will consider all such characteristics to divide subspace clustering
approaches in various families. The algorithms belonging to same family will satisfy common
characteristics. Such a categorization will help future developers to better understand the quality criteria to
be used and similar algorithms to be used to compare results with their proposed clustering algorithms. In
this paper, we first proposed the concept of SCAF (Subspace Clustering Algorithms’ Family).
Characteristics of SCAF will be based on the classes such as cluster orientation, overlap of dimensions etc.
As an illustration, we further provided a comprehensive, systematic description and comparison of few
significant algorithms belonging to “Axis parallel, overlapping, density based” SCAF.
GET IEEE BIG DATA,JAVA ,DOTNET,ANDROID ,NS2,MATLAB,EMBEDED AT LOW COST WITH BEST QUALITY PLEASE CONTACT BELOW NUMBER
FOR MORE INFORMATION PLEASE FIND THE BELOW DETAILS:
Nexgen Technology
No :66,4th cross,Venkata nagar,
Near SBI ATM,
Puducherry.
Email Id: praveen@nexgenproject.com
Mobile: 9791938249
Telephone: 0413-2211159
www.nexgenproject.com
A new model for iris data set classification based on linear support vector m...IJECEIAES
Data mining is known as the process of detection concerning patterns from essential amounts of data. As a process of knowledge discovery. Classification is a data analysis that extracts a model which describes an important data classes. One of the outstanding classifications methods in data mining is support vector machine classification (SVM). It is capable of envisaging results and mostly effective than other classification methods. The SVM is a one technique of machine learning techniques that is well known technique, learning with supervised and have been applied perfectly to a vary problems of: regression, classification, and clustering in diverse domains such as gene expression, web text mining. In this study, we proposed a newly mode for classifying iris data set using SVM classifier and genetic algorithm to optimize c and gamma parameters of linear SVM, in addition principle components analysis (PCA) algorithm was use for features reduction.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
A fuzzy clustering algorithm for high dimensional streaming data
1. Journal of Information Engineering and Applications
ISSN 2224-5782 (print) ISSN 2225-0506 (online)
Vol.3, No.10, 2013
www.iiste.org
A Fuzzy Clustering Algorithm for High Dimensional Streaming Data
Diksha Upadhyay, Susheel Jain, Anurag Jain
Department of Computer Science, RITS, Bhopal, India
Email: {diksha.du31@gmail.com, jain_susheel65@yahoo.co.in, anurag.akjain@gmail.com}
Abstract
In this paper we propose a dimension reduced weighted fuzzy clustering algorithm (sWFCM-HD). The
algorithm can be used for high dimensional datasets having streaming behavior. Such datasets can be found in
the area of sensor networks, data originated from web click stream and data collected by internet traffic flow etc.
These data’s have two special properties which separate them from other datasets: a) They have streaming
behavior and b) They have higher dimensions. Optimized fuzzy clustering algorithm has already been proposed
for datasets having streaming behavior or higher dimensions. But as per our information, nobody has proposed
any optimized fuzzy clustering algorithm for data sets having both the properties, i.e., data sets with higher
dimension and also continuously arriving streaming behavior. Experimental analysis shows that our proposed
algorithm (sWFCM-HD) improves performance in terms of memory consumption as well as execution time
Keywords-K-Means, Fuzzy C-Means, Weighted Fuzzy C-Means, Dimension Reduction, Clustering.
I. INTRODUCTION
In recent years there are various sources , for generating data streams of continuous behavior has Came in to
existence , such as data from sensor networks, data generated by web click stream and data stream from internet
traffic data transfer, now a days data stream become an important source of data. As a result, many researchers
are giving importance on it. Finding efficient data stream mining algorithm has become an important research
subject. Data stream [1] is potential infinite, with uncertain arriving speed and can be scanned one pass. The
processing of data stream has to implement within a limited space (memory) and a strict time constraint. Due to
this, an efficient data stream mining algorithms must satisfy a more strict demand.
The simple comparative analysis of various dimension reduction techniques and various clustering techniques
(survey) has been provided in the [20]. Cluster analysis pays a very important role in data mining field.
Clustering algorithm based on data stream model has gained an extensive research [1], [2], [3], [4], [5]. Fuzzy C
means (FCM) and its improvements [6], [7] as important clustering methods have been abroad used in many
aspects such as in the field of data mining, in pattern recognition, in the field of machine learning and so on. In
[8] the author proposed a weighted fuzzy c-means (sWFCM) clustering algorithm for datasets having streaming
behavior. The various effects and issues of high dimensionality property of data sets on clustering, in solving
nearest neighbor problem and on indexing has been observed by various researchers in detail. Due to high
dimensions the data becomes sparse; the Conventional (previous) indexing and algorithmic procedures fail from
an efficiency and effectiveness perspective.
On high dimensional data it has been observed that, the various parameters such as proximity measures, distance
calculation or finding nearest neighbor may not be that much effective and meaningful may not even be
qualitatively meaningful. In the Recent research result shows the dimensionality property of data sets from the
prospective of distance metrics which will be further used to find the similarity between data objects [9]. Further,
high-dimensional data will create various challenging issues for various conventional clustering algorithms
which require definite solutions.
In high dimensional data, traditional similarity measures as used in conventional clustering algorithms are
usually not meaningful. Common approaches to handle high dimensional data are subspace clustering, projected
clustering, pattern based clustering or correlation clustering [10]. Due to the presence of various irrelevant
features or of Correlation among subsets of features will heavily impact the generation and visualization of
clusters in the full-dimensional space. The major challenge the clustering will face is that the clusters will be
formed as per the subspace of features from the total feature space but the feature subspace for various clusters
may be different.
The K-Means is the famous clustering algorithm which is simple and widely applicable partitioned clustering
technique. The space complexity of the K-Means algorithm is O ((n + k) d) and the time complexity is O (nKtd)
where n is the number of data, K is the possible number of clusters, d is denoting the dimension of the data and t
is the number of iterations. In [11] the authors proposed a technique to convert high dimensional data into two
dimensional data and then simple K Means algorithm has been applied on the transformed dataset. The intention
of this modified algorithm is to reduce the dimension of the data to increase the efficiency of the K-Means
clustering algorithm.
In this paper we propose a dimension reduced weighted fuzzy c-means algorithm (sWFCM-HD). The algorithm
will be applicable for those high dimensional data sets that have streaming behavior. An example of such data
1
2. Journal of Information Engineering and Applications
ISSN 2224-5782 (print) ISSN 2225-0506 (online)
Vol.3, No.10, 2013
www.iiste.org
sets is live high-definition videos in internet. These datas have two special properties which separate them from
other data sets: a) They have streaming behavior and b) They have higher dimensions. As we discussed above
optimized K-means algorithm has already been proposed for data sets having streaming behavior as well as data
sets having higher dimension. But as per our information, nobody has proposed any optimized K-means
algorithm for data sets having both the properties, i.e., data sets with higher dimension and also continuously
arriving streaming behavior. So, our work will be a combination of the work done in [11] and [8]. But to the best
of our knowledge, all clustering algorithms for data stream commonly belong to hard cluster. Fuzzy clustering
algorithm provided in the present is not used directly for data streams.
The rest of the paper is organized as per following. In the next section we discussed related research works.
Section III will provide the background details required for this paper. We explained our proposed algorithm in
section IV, and then after experimental comparisons and analysis are given in section V.And then after finally
we conclude the paper in section VI
II. RELATED WORK
There are substantial amount of clustering data stream algorithms presented. In [2], the STREAM algorithm is
proposed to cluster data streams. STREAM first determines the size of sample. If the condition arises where size
of data chunk is larger than the sample size of data, then in such case a LOCALSEARCH procedure (algorithm)
will be invoked for obtaining the clusters of the data chunks. And then after, the LOCALSEARCH procedure is
applied on previous iterations generated all the cluster centers.
The k-means algorithm is extended and the VFKM algorithm is proposed in [3]. It is guaranteed that the
Generated model produced will not differ significantly from the one that would be obtained with infinite data. A
variant of the k-means algorithm, incremental k-means, is proposed to obtain high quality solutions. In [4] the
authors have proposed a system (time series clustering technique) which will create the hierarchy of clusters on
the incremental basis .The correlation between time series is used as similarity measure. Cluster decomposition
or composition (aggregation) will be performed at each step.In [5], the CluStream algorithm is proposed to
cluster evolving data streams. CluStreams idea is dividing the clustering method in the online component which
will afterwards periodically stores complete summary measures (statistics) and an offline component which uses
only this summary statistics. Pyramidal time frame parameters in collaboration with a micro-clustering approach
is used to deal with the problems of generating efficient choice, providing storage, and use of the present
statistical data for a continuous fast data stream.
For the purpose of clustering image data which is larger in size a method has been proposed based on sampling
phenomena in [12], where the samples are chosen by the chisquare or hypothesis test as per divergence. In [13],
speeding up is obtained by performing the random sampling of the data and then after clustering it. The centroids
which will be obtained are then after used for initializing the entire data set. Two well known techniques for
dimensional reduction techniques are feature selection and feature extraction; firstly before applying any data
mining task, a practical approach to overcome the problems of high dimensional dataset where several features
are correlated is to perform feature selection [9]. For feature selection there may be unsupervised (PCA [14],
LLE [15], ISOMAP [16]) learning techniques which Will understand the low dimensional space that classify
(represents) well the data without need to any specific task. Principal Component Analysis (PCA) can be used to
map the original provided data sets in higher dimensions to a lower dimensional data space where the points may
better cluster and the resulting clusters may be more meaningful. For nonlinear approaches, Sammons mapping,
multidimensional scaling and LTSA [17] are available. Dimensionality reduction techniques which are
supervised in nature Discriminative PLVM [18]) try to estimate a low dimensional representation which has
sufficient information for predicting the task target values. The above provided these supervised techniques
presumes that the latent space and/or the given data are being generated from some restricted distribution
phenomena.
Various soft computing tools are also available for feature selection and feature extraction [19]. Next, the
decision tree induction can be used for attribute subset selection, a decision tree is constructed from the whole
data and the attributes that did not appear in tree are assumed to be less dominant. After analyzing the tree where
the attributes do appear are to be selected as important attribute. Unfortunately, such dimensionality reduction
techniques cannot be applied to clustering problems because such techniques are global since they generally
compute only one composite subspace of the provided original data space in which the clustering can then be
performed, considering complete set of points. In contrast, the problem of local feature relevance and local
feature correlation classify that many subspaces are needed because each cluster may exist in a different
subspace [10]. In [11] dimension reduction technique has been proposed Which will first convert the high
dimensional data sets in to two dimensional data and then for increasing the clustering efficiency K-Means
clustering algorithm have been applied on the resultant data (two dimensional data) .
2
3. Journal of Information Engineering and Applications
ISSN 2224-5782 (print) ISSN 2225-0506 (online)
Vol.3, No.10, 2013
www.iiste.org
The difference of the above proposals with our proposal is none of them tried to handle higher dimensional
dataset with streaming behavior. Both high dimensional dataset and streaming datasets has been widely studied
before but as per our knowledge no one has tried yet to propose any clustering technique dedicated for datasets
having higher dimension as well as streaming behavior.
III. BACKGROUND
A. FCM algorithm
Consider a data set X = {x1,x2,x3,,,,,,,xn}, the FCM algorithm partitions X into c fuzzy clusters and find out each
clusters center so that the cost function (objective function) of dissimilarity measure is minimization or below a
certain threshold. FCM analyze membership value of each data in each cluster, it is presented as follows:
Objective function:
n
c
∑ ∑ (u
J m (U , v ) =
ik
) m ( d ik ) 2
(1)
k = 1 i =1
U and v can be calculated as:
1
u ik =
c
∑
j =1
d
(
d
ik
)
(2)
2
( m −1)
jk
n
∑ (u
vi =
ik
)m xk
k =1
n
∑ (u
(3)
ik
)
m
k =1
u ik is the membership value of the k th data xk in the i th cluster. d ik = || xk - vi || is the Euclidean
distance between data xk and the cluster centroid vi ,1≤ i ≤ c , 1≤ k ≤ n , exponent m > 1.
Where
The FCM algorithm determines the cluster centroid
vi and the membership matrix U through iterations using
the following steps:
1.
Initialize the membership matrix U ,
c
∑ (u
ik
u ik randomly comes from (0, 1) and satisfy:
) = 1, 1 ≤ k ≤ n
i =1
c fuzzy clusters vi i =1,………. c using Equation 3.
2.
Calculate
3.
Compute the objective function according to Equation 1. Stop if objective function of dissimilarity
measure is minimization or concentrate on a particular value or if its improvement results over previous
iteration outcomes is below a certain threshold or iterations reach a certain tolerance value.
Compute a new U using Equation 2. Go to step 2.
4.
As FCM is clustering on the total data set, and data stream may contain a very large data set, so let FCM deal
with data stream directly may consume significant amounts of CPU time to converge, or result in an intolerable
iteration quantity. Based on this situation, [8] proposed one alternative called weighted FCM algorithm (swFCM)
for data stream as discussed in the next section.
B. Weighted FCM (swFCM)
First, divide data stream into chunks X1, X2,…………. Xs according to the reaching time of data, and the size of each
chunk is determined by main memory of the processing system, let n1, n2,………ns be the data numbers of chunks
3
4. Journal of Information Engineering and Applications
ISSN 2224-5782 (print) ISSN 2225-0506 (online)
Vol.3, No.10, 2013
www.iiste.org
X1, X2,…………. Xs respectively. Due to its stream setting, a time weight
the datum influence extent on the clustering process, and
w (t) is imposed on each data representing
tc
∫
= 1
w ( t ) dt
t0
Where
t 0 is the initial time of stream and t c is the current time.
The main idea of sWFCM is renewing the weighted clustering centers by iterations till the cost function gets a
satisfying result or the number of iteration is to a tolerance. Moreover, during the processing, we give the
singleton a constant weight as 1. The procedure is presented as follow:
X l (1≤ l ≤ s ).
1) Import the chunk
2) Update the weight of cluster centroids.
•
If l = 1: Apply FCM to gain cluster centroids
n1
wi' = ∑ (u ij ) w j
vi i =1,…… c ,and compute:
1≤ i ≤ c
j =1
Where w j = 1, ∀1 ≤ j ≤ n1
•
If l l> 1:
nl + c
wi' = ∑ (uij )w j
1≤ i ≤ c
j =1
Where w j = 1, ∀ c + 1 ≤ j ≤ n l + c
The centroid weight
wi then updates as wi = wi'
3) Update cluster centroids:
nl + c
∑w
vi =
k
( u ik ) m x k
k =1
nl + c
∑w
k
( u ik ) m
k =1
Where
xk ∈ { vi | 1 ≤ i ≤ c } U X l
4) Compute objective function:
c
J m (U , v ) = ∑ c + nl ∑ wk (u ik ) m ( d ik ) 2
k =1
i =1
Stop if objective function is minimization or concentrate on a certain value, or its improvement
Over previous results obtained from iterations is below a certain threshold, or iterations reach a
Certain tolerance value.
5) Compute a new U using Equation 2. Go to step 2.
6) If l = s then stop, else go to step 1.
C. Converting high dimensional dataset into two dimensional data set
We used the technique proposed in [11] for reducing dimension of higher dimensional datasets. In this technique
each high dimensional data in the dataset is converted to a two dimensional co-ordinate point. So the clustering
algorithm can take the converted two dimensional dataset as input instead of higher dimensional dataset. The
working of the dimension reduction technique [11] is explained below: Let O = oi,o2,….,on be a d-dimensional
4
5. Journal of Information Engineering and Applications
ISSN 2224-5782 (print) ISSN 2225-0506 (online)
Vol.3, No.10, 2013
www.iiste.org
dataset. Now to convert each d-dimensional data oi Є O two dimensional coordinate point (Xi, Yi) do the
following:
Calculate Xi and Yi as
xi 0 + xi1 + ......xid −1
d
Xi =
And
Yi =
yi0 + yi1 + .......yid −1
d
For each j th dimensional value of i th data in O (i.e., oij ), we can get a co-ordinate point ( x ij , y ij ) . Where
xij = rij cos θ j and. y ij = rij sin θ j rij
means the value of oij (value in
j th dimension of i th data).
θ j = θ j −1 + 360 / d , and θ 0 = 0 0 . In other words for each data oi Є O, 1≤ i ≤ n there must be d numbers of
coordinate points ( x ij , y ij ) , 1 ≤ i ≤ n , 1 ≤ j ≤ d and with help of these coordinate point ( x ij , y ij ) we can get
the mean value
( X i , Yi ) . Plot all the n numbers of mean points on the two dimensional plane and then apply
clustering algorithm on the plotted mean points.
IV. OUR PROPOSED TECHNIQUE (SWFCM-HD)
The disadvantages of using higher dimensional datasets in clustering algorithms are already explained in section
I. A dimension reduction technique is proposed in [11] to overcome such difficulties. But if the dataset has
streaming behavior then even after converting it into a lower dimensional dataset the problem still remains [8],
[11]. In section I, we explained the disadvantages of applying FCM algorithm on a dataset having streaming
behavior. We combine both dimension reduction and sWFCM technique together to propose a better fuzzy
clustering algorithm for large size high dimensional stream datasets. We call our propose algorithm as sWFCMHD as we used sWFCM and a dimension reduction technique for higher dimensional streaming datasets. Our
algorithm is discussed as follows:
Algorithm: sWFCM-HD
Input: High dimensional (d-dimensional) large dataset O with having streaming behavior.
1) Convert the d-dimensional dataset O into two dimensional dataset X using the dimension reduction technique
discussed in section III-C.
2) Apply sWFCM algorithm on the converted two dimensional dataset X. The sWFCM algorithm is discussed in
section III-B.
Note that, since the dataset O has streaming behavior it is not possible to reduce the dimension of the entire
dataset at a time. But that doesn’t create any problem because sWFCM algorithm uses a chunk of data from
dataset at a time. We can see from section III-B that before applying sWFCM, we need to divide the dataset into
number of data chunks. Main reason for this is because in real scenario these data are streaming in nature and
will not be loaded into main memory all together. Hence, the dimension reduction technique has been applied on
chunk basis and not all together
V. EXPERIMENTAL ANALYSIS
We take higher dimensional dataset as input and converted them into two dimensional data set as discussed in
section III-C. After reducing the dimension of the dataset we run SWFCM on it. Though sWFCM already exists
we used it here for clustering higher dimensional data after reducing their dimension. Experiment shows that
sWFCM performs better than FCM for higher dimensional dataset having streaming behavior. Our main
intention here is to show that if we combine the techniques proposed in [11] and [8] together for a clustering
algorithm then performance will get enhanced much in comparison to the performance of any individual one.
Note that, our proposed algorithm (sWFCM-HD) is a combination of the techniques proposed in [11] and [8]
(see section IV). We use FCM algorithm on the reduced (2D) dataset as baseline algorithm. For the experiments
we use three higher dimensional large size dataset: KDDCUP 1999, Nursery and Letter recognition. All three
datasets are available in http://archive.ics.uci.edu/ml/datasets.html. Since KDDCUP 1999 is a very large dataset
we used the first 5000 data from it.
5
6. Journal of Information Engineering and Applications
ISSN 2224-5782 (print) ISSN 2225-0506 (online)
Vol.3, No.10, 2013
www.iiste.org
A. Cluster Validity
We adopt validity functions [8] to compare cluster efficiency. The validity functions are based on partition
coefficient and partition entropy of U.
Partition coefficient for FCM
V pc (U ) =
1 n c 2
(∑∑ u ij )
n j =1 i =1
Partition coefficient for sWFCM
V pc (U ) =
1 n c
2
(∑∑ wi u ij )
n j =1 i =1
Partition entropy for FCM
1 n c
V pe (U ) = − (∑∑ u ij log u ij )
n j =1 i =1
Partition entropy for sWFCM
1 n c
V pe (U ) = − (∑∑ wi u ij log u ij )
n j =1 i =1
Where
n is the total number of data in the dataset, wi , u ij ,U are weight of centroids and membership matrix
respectively (see section III for details.)
Table I, II and III shows cluster validity in terms of partition coefficient and partition entropy for the three
datasets: nursery, KDDCUP 1999 and letter recognition respectively.
B. Memory Used
Since sWFCM process data as number of chunks we calculated the memory consumption of each chunk
separately and take the largest value as the final memory consumption for sWFCM-HD. Since the dataset is
streaming in nature, it is not required for sWFCM-HD to access more than one chunk at a time. Figure 1 shows
the percentage of improvement in terms of memory consumption by proposed (sWFCM-HD) as compared to the
baseline algorithm. The improvement is more than 97% for all three datasets. Baseline
Algorithm (FCM) uses entire dataset at a time and hence it requires enough memory to hold the complete dataset.
This is the reason why baseline requires much higher memory than our proposed algorithm.
6
7. Journal of Information Engineering and Applications
ISSN 2224-5782 (print) ISSN 2225-0506 (online)
Vol.3, No.10, 2013
www.iiste.org
(a) Nursery Data Set.
(b) KDD Cup 1999 Data Set.
(c) Letter Recognition Data Set.
Figure 1. Percentage improvement for memory consumption in proposed sWFCM-HD over baseline (FCM).
C. Execution Time
Similar as memory consumption we also calculated execution time for each chunk separately and take the largest
value as the final execution time for our proposed algorithm. Our main aim here is to calculate the execution
time of algorithm and sWFCM-HD will only process one chunk at a time and there is no time bound as when the
next chunk will arrive. Figure 2 shows the percentage of improvement in sWFCM-HD as compared to baseline
in terms of execution time. The huge improvement shown is possible because we compare the execution time of
baseline (which uses entire dataset at a time) with the largest execution time by a chunk in sWFCM-HD. The
total execution time (adding the execution time of
7
8. Journal of Information Engineering and Applications
ISSN 2224-5782 (print) ISSN 2225-0506 (online)
Vol.3, No.10, 2013
www.iiste.org
(a) Nursery Data Set.
(b) KDD Cup 1999 Data Set.
(c) Letter Recognition Data Set
Figure 2. Percentage improvement for execution time in proposed sWFCM-HD over baseline (FCM).
VI. CONCLUSION
To mine the data from the data streams is very difficult because of the limited amount of memory availability
and real time query response requirement. The major task to perform mining on any input data is through
clustering. On the other hand, high-dimensional data poses different problem (challenges) for clustering
algorithms that require specialized solutions. In high dimensional data, for clustering traditional similarity
measures as which used in conventional clustering algorithms are usually not meaningful. In this paper we
propose a dimension reduced weighted fuzzy clustering algorithm (sWFCM-HD). The algorithm can be used for
high dimensional datasets having streaming behavior. Such as data from sensor networks, data generated by web
click stream and data stream from internet traffic data transfer etc these data’s have two special properties which
separate them from other datasets: a) They have streaming behavior and b) They have higher dimensions.
Optimized fuzzy clustering algorithm has already been proposed for datasets having streaming behavior or
higher dimensions. But as per our information, nobody has proposed any optimized fuzzy clustering algorithm
for data sets having both the properties, i.e., data sets with higher dimension and also continuously arriving
streaming behavior. Experimental analysis shows that our proposed algorithm (sWFCM-HD) improves
performance in terms of memory consumption as well as execution time.
8
9. Journal of Information Engineering and Applications
ISSN 2224-5782 (print) ISSN 2225-0506 (online)
Vol.3, No.10, 2013
www.iiste.org
REFERENCES
[1] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, “Models and issues in data stream Systems,” in
Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database
systems, ser. PODS ’02, 2002, pp. 1–16.
[2] L. O’Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani, “Streaming-data algorithms for highquality Clustering,” in Data Engineering, 2002. Proceedings. 18thInternational Conference on, 2002, pp.
685–694.
[3] P. Domingos and G. Hulten, “A general method for scaling up machine learning algorithms and its
Application to Clustering,” in Proceedings of the Eighteenth International Conference on Machine Learning,
ser. ICML ’01, 2001, pp. 106–113.
[4] P. Rodrigues, J. Gama, and J. Pedroso, “Hierarchical clustering of time-series data streams,” Knowledge and
Data Engineering, IEEE Transactions on, vol. 20, no. 5, pp. 615– 627, 2008.
[5] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, “A framework for clustering evolving data streams,” in
Proceedings of the 29th international conference on Very large data bases - Volume 29, ser. VLDB ’03, 2003,
pp. 81–92.
[6] S. Eschrich, J. Ke, L. Hall, and D. Goldgof, “Fast fuzzy clustering of infrared images,” in IFSA World
Congress and 20th NAFIPS International Conference, 2001. Joint 9th, vol. 2, 2001, pp. 1145–1150 vol.2.
[7] M. B. Al-Zoubi, A. Hudaib, and B. Al-Shboul, “A fast fuzzy clustering algorithm,” in Proceedings of the
6thConference on 6th WSEAS Int. Conf. on Artificial Intelligence, Knowledge Engineering and Data Bases Volume 6, ser. AIKED’07, 2007, pp. 28–32.
[8] R. Wan, X. Yan, and X. Su, “A weighted fuzzy clustering algorithm for data stream,” in Proceedings Of the
2008 ISECS International Colloquium on Computing, Communication, Control, and Management - Volume
01, ser. CCCM ’08, 2008,pp. 360–364.
[9] C. C. Aggarwal, A. Hinneburg, and D. A. Keim, “On the surprising behavior of distance metrics in high
dimensional spaces,” in Proceedings of the 8th International Conference on Database Theory, ser. ICDT ’01,
2001, pp. 420– 434.
[10] H.-P. Kriegel, P. Kr¨oger, and A. Zimek, “Clustering high dimensional data: A survey on subspace
Clustering, pattern based clustering, and correlation clustering,” ACM Trans. Knowl. Discov. Data, vol. 3, no.
1, pp. 1:1–1:58, Mar. 2009.
[11] P. Bishnu and V. Bhattacherjee, “A dimension reduction technique for k-means clustering algorithm,” in
Recent Advances in Information Technology (RAIT), 2012 1st International Conference on, 2012, pp. 531–
535.
[12] N. R. Pal and J. C. Bezdek, “Complexity reduction for ”large image” processing,” Trans. Sys. Man Cyber.
Part B, vol. 32, no. 5, Oct. 2002.
[13] D. Altman, “Efficient fuzzy clustering of multi-spectral images,” in Geoscience and Remote Sensing
Symposium,1999. IGARSS ’99 Proceedings. IEEE 1999 International, vol. 3, 1999, pp. 1594–1596 vol.3.
[14] I. Fodor. (2002) A Survey of Dimension Reduction Techniques. [Online]. Available:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.8.5098
[15] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” SCIENCE,
vol.290, pp. 2323–2326, 2000.
[16] J. B. Tenenbaum, V. d. Silva, and J. C. Langford, “A global geometric framework for nonlinear
dimensionality reduction,” Science, vol. 290, no. 5500, pp. 2319–2323, 2000.
[17] L. Teng, H. Li, X. Fu, W. Chen, and I.-F. Shen, “Dimension reduction of microarray data based on Local
tangent space alignment,” in Proceedings of the Fourth IEEE International Conference on Cognitive
Informatics, ser. ICCI ’05, 2005, pp. 154–159.
[18] R. Urtasun and T. Darrell, “Discriminative gaussian process latent variable model for classification,” In
Proceedings of the 24th international conference on Machine learning, ser. ICML ’07, 2007, pp. 927–934.
[19] L. Tan and Y. Zhang, “A comparative study of dimension reduction based on data distribution,” in
Intelligent Systems (GCIS), 2010 Second WRI Global Congress on, vol. 3, 2010, pp. 309–312.
[20] Diksha Upadhyay, Susheel Jain, Anurag Jain “Comparative Analysis of Various Data Stream Mining
Procedures and Various Dimension Reduction Techniques” International journal of Advanced Research in
computer science “Volume 4, No.8, May-June 2013.
9
10. This academic article was published by The International Institute for Science,
Technology and Education (IISTE). The IISTE is a pioneer in the Open Access
Publishing service based in the U.S. and Europe. The aim of the institute is
Accelerating Global Knowledge Sharing.
More information about the publisher can be found in the IISTE’s homepage:
http://www.iiste.org
CALL FOR JOURNAL PAPERS
The IISTE is currently hosting more than 30 peer-reviewed academic journals and
collaborating with academic institutions around the world. There’s no deadline for
submission. Prospective authors of IISTE journals can find the submission
instruction on the following page: http://www.iiste.org/journals/
The IISTE
editorial team promises to the review and publish all the qualified submissions in a
fast manner. All the journals articles are available online to the readers all over the
world without financial, legal, or technical barriers other than those inseparable from
gaining access to the internet itself. Printed version of the journals is also available
upon request of readers and authors.
MORE RESOURCES
Book publication information: http://www.iiste.org/book/
Recent conferences: http://www.iiste.org/conference/
IISTE Knowledge Sharing Partners
EBSCO, Index Copernicus, Ulrich's Periodicals Directory, JournalTOCS, PKP Open
Archives Harvester, Bielefeld Academic Search Engine, Elektronische
Zeitschriftenbibliothek EZB, Open J-Gate, OCLC WorldCat, Universe Digtial
Library , NewJour, Google Scholar