Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Current clustering techniques

Survey of current clustering techniques of big data analysis and with provides the algorithm of ELM k-mean algorithm in detail.

  • Be the first to comment

Current clustering techniques

  1. 1. Guided By: Presented By: Prof. Prashant G. Ahire Miss.Poonam Kshirsagar Roll No. 204 A SURVEY OF CLUSTERING TECHNIQUES FOR BIG DATA ANALYSIS
  2. 2. Agenda  Problem Definition  Objective Literature Survey  Big Data and it’s Analytics Challenges  Cluster  Criterion To Benchmark Clustering Methods  Proposed System  ELM  ELM Feature Mapping Process  ELM K-mean Algorithm  Advantages  Disadvantages  Conclusion
  3. 3. Problem Definition: Among various challenges in analyzing big data the major issue is to design and develop the new techniques for clustering. Cloud computing can be used for big data analysis but there is problem to analyze data on cloud environment as many traditional algorithms cannot be applied directly on cloud environment and also there is an issue of applying scalability on traditional algorithms, delay in result produced and accuracy of result produced.  These issues can be addressed by clustering techniques.
  4. 4. Objectives: The objectives of the thesis are as follows: To study the existing clustering techniques for analyzing big data.  To propose and design an efficient clustering technique for big data analysis.
  5. 5. Literature Survey: Topic Name Keywords Abstract Author Name A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis Clustering algorithms, unsupervised learning, big data we highlighted the set of clustering algorithms that are the best performing for big data. ADIL FAHAD 1,4 , NAJLAA ALSHATRI 1 , ZAHIR TARI 1 , (Member, IEEE) Clustering in extreme learning machine feature space ELM means The good properties of the ELM feature mapping, the clustering problem using ELM feature mapping techniques is studied in this paper. Qing He a, n , Xin Jin a,b , Changying Du a,b , Fuzhen Zhuang a A Hybrid Approach for Efficient Clustering of Big Data big data, Basic K- Means Algorithm using MapReduce,, Basic DBSCAN Algorithm using MapReduce This is presents a theoretical overview of some of current clustering techniques used for analyzing big data Saurabh Arora, Department of Computer Science and Engineering ,Thapar University Patiala,India, A Survey of Clustering Techniques for Big Data Analysis Big data, Clustering Techniques, Data Mining In this paper we have discussed some of the current big data mining clustering techniques. Saurabh Arora, Inderveer, dept. of CS
  6. 6. What is Big data ? Big Data means data that’s too big, too fast , or too hard for existing tools to process.  Too big : Peta byte-scale collection of data.  Too fast: Processed quickly.  Too hard: It is a catch all for data that doesn’t fit neatly into an existing processing tools.
  7. 7. Fig. Evolution Of big data Continue…
  8. 8. Fig.: Significant growth of Big data Continue…
  9. 9. Big Data Analytics Challenges: The main challenges for big data analytics are listed below : Volume of data is large and also varies so challenge is how to deal with it. Analysis of all data is required or not. All data needs to be stored or not. To analyze which data points are important and how to find them. How data can be used in the best way.
  10. 10. What is Cluster ?  Clustering is a division of data into groups of similar objects. Each group, called cluster.  Cluster consists of objects that are similar between themselves and dissimilar to objects of other groups.  It is one of the major techniques used for data mining.
  11. 11. Criterion To Benchmark Clustering Methods: Volume : Refers large amount of data Criteria: (i) Size of the dataset (ii) Handling high dimensionality (iii) Handling outliers/ noisy data Velocity : Refers speed of processing data. Criteria : (i) Complexity of algorithm (ii) The run time performances Variety: Refers to the ability to handle different types of data (i) Type of dataset (ii) Clusters shape.
  12. 12. Comparative Analysis of Current Clustering Techniques  Partition Clustering Techniques 1.K-mean and variant partitioning techniques: Example : K-MCI algorithm 2.Other Partitioning Techniques: Example : Cuckoo search  Hierarchical Clustering Techniques Example : ACA-DTRS FACA-DTRS
  13. 13.  Density Based Clustering Techniques Example : DMM clustering algorithm DBCURE Algorithm  Generic Clustering Techniques: Example : BRICH Algorithm
  14. 14. Proposed System:  In the partitioning clustering techniques K-Means is being used for past so many years.  Now days but ELM K-means or ELM FCM is best suited among all  Methods as it finds best quality clusters and in less computation time.  ELM feature is easy to implement and it works well for big datasets.
  15. 15.  Fast learning speed.  Ease of implementation.  Minimal human intervention.  ELM tends to have better scalability. Extreme Learning Machine
  16. 16. ELM Feature Mapping Process Where, 1. G(ai,bi,x) is the output of the i th hidden node 2. ai is a d-dimensional weight vector between the d input nodes and the i th hidden-node 3. bi is the bias of ith hidden-node.  ELM will map the data into the L-dimensional ELM feature space H, and L is the number of the hidden nodes used in the feature mapping process
  17. 17. Fig.: ELM Feature Mapping Process Continue…
  18. 18. •K-Means clustering problem can be described as follows: •Given a set of observations (x1,x2,……xm) where each observation is a d-dimensional real vector •k-Means clustering aims to partition the m observations into k sets •so as to minimize the within-cluster sum of squares (WCSSs): Where, μi is theme an of point sin Si. Continue…
  19. 19. ELM k-Means algorithm Input: k : the number of clusters, L : the number of the hidden-layer nodes, D : a data set containing m objects. Output : A set of k clusters. Method : 1: Mapping the original data object sin D into the ELM feature space H using h(x)=[H1(x),….,hi(x),…hl(x)]T ; 2: Arbitrarily choose k objects from H as the initial cluster centres; 3: repeat 4: (Re) assign each object to the cluster to which the object is the most similar , based on the mean value of the object sin the cluster; 5: Update the cluster means , i.e. , calculate the mean value of the objects for each cluster; 6: until no change in the cluster centres or reached the maximal iteration number limit. 7: return A set of k clusters.
  20. 20. Advantages:  ELM features are easy to implement and ELM K-means produce better results than Mercer kernel based methods.  The mapping is very intuitive and straight forward
  21. 21. Disadvantages  Number of nodes should be greater than 300 else performance is not optimal.  After studying these techniques it is observed that still new methodologies are required for analyzing big data as these techniques could are not so efficient for analyzing real time and online streaming data
  22. 22. Conclusion: we have studied various clustering techniques which are currently used for analyzing big data. All these recent techniques are compared on the basis of execution time and cluster quality and their merits and demerits are provided.