Presented By,Manasi C. KadamSharmishtha P. AlwekarGanesh H. SatputeDeepak D. AmbegaonkarRajesh V. Dulhani   Under the guid...
Agenda                       Introduction   Clustering   K-means clustering algorithm   Canopy clustering algorithm ...
Introduction                    Tedious task to maintain large Data Types     1.   Structured     2.   Unstructured
Introduction to Data          analysis              Extracting information out of data Two types  1. Exploratory or des...
Clustering     (Aka Unsupervised Learning)                          Goal is to discover the natural grouping(s) between ...

Problems in clustering                Cluster can differ in size, shape & density   Presence of noise   Cluster is a s...
Clustering Algorithm             Types of Clustering Algorithm   1. Hierarchical   2. Partitional Hierarchical – recurs...
K-means algorithm           
K-means Algorithm          (contd.)              Goal of K-means is to minimize the sum of the squared error over all K ...
Flowchart   
Class Diagram of K-means          
Parameter for K-means           Most critical choice is K    Typically algorithm is run for various values of K and    ...
Canopy Clustering              Traditional clustering algorithm works well when  dataset has either property.   Large n...
Canopy Clustering          (contd.)              Performs clustering in two stages  1. Rough and quick stage  2. Rigorou...
Canopy Clustering         (contd.)             Rough and quick stage   Uses extremely inexpensive method   divides the...
Flowchart of Canopy    Clustering        
Source: Ref [2]
Output of K-means onMathematica on Same Dataset            
Output of K-means on R on      Same Dataset           
Output of K-means onMicrosoft Excel on Same       Dataset          
Output of canopy on Excel on       Same Dataset            
Complexity                   Complexity of K-means is O(nk), where n is number of objects and k is number of centroids ...
Conclusion                  Implemented K-means Algorithm Verified Result on Mathematica, R Implemented Canopy Cluster...
Future Enhancement              Learning Hadoop and MapReduce Parallelizing K-Means based on MapReduce and  comparing t...
References                    Anil K. Jain, “Data Clustering: 50 Years Beyond K-  Means” Andrew McCallum et al., “Effic...
Thank You
Clustering
Clustering
Upcoming SlideShare
Loading in...5
×

Clustering

352

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
352
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
15
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Clustering

  1. 1. Presented By,Manasi C. KadamSharmishtha P. AlwekarGanesh H. SatputeDeepak D. AmbegaonkarRajesh V. Dulhani Under the guidance Prof. G. A. Patil Mr. Varad Meru
  2. 2. Agenda  Introduction Clustering K-means clustering algorithm Canopy clustering algorithm Complexity Evaluation Conclusion Future Enhancement References
  3. 3. Introduction  Tedious task to maintain large Data Types 1. Structured 2. Unstructured
  4. 4. Introduction to Data analysis  Extracting information out of data Two types 1. Exploratory or descriptive 2. Confirmative or inferential
  5. 5. Clustering (Aka Unsupervised Learning)  Goal is to discover the natural grouping(s) between objects Given n objects find K groups on measure of “similarity” Organizing data into clusters such that there is • high intra-cluster similarity • low inter-cluster similarity Ideal cluster - set of points that is compact and isolated Ex. K-means algorithm, k-medoids etc.
  6. 6.
  7. 7. Problems in clustering  Cluster can differ in size, shape & density Presence of noise Cluster is a subjective entity Automation
  8. 8. Clustering Algorithm  Types of Clustering Algorithm 1. Hierarchical 2. Partitional Hierarchical – recursively finds nested clusters  Types 1. Agglomerative 2. Divisive Partitional - finds all the clusters simultaneously ex. K-means
  9. 9. K-means algorithm 
  10. 10. K-means Algorithm (contd.)  Goal of K-means is to minimize the sum of the squared error over all K clusters
  11. 11. Flowchart 
  12. 12. Class Diagram of K-means 
  13. 13. Parameter for K-means  Most critical choice is K  Typically algorithm is run for various values of K and most appropriate output is selected Different initialization can lead to different output
  14. 14. Canopy Clustering  Traditional clustering algorithm works well when dataset has either property.  Large number of clusters  A high feature dimensionality  Large number of data points. When dataset has all three property at once computation becomes expensive. This necessitates need of new technique, thus canopy clustering
  15. 15. Canopy Clustering (contd.)  Performs clustering in two stages 1. Rough and quick stage 2. Rigorous stage
  16. 16. Canopy Clustering (contd.)  Rough and quick stage  Uses extremely inexpensive method  divides the data into overlapping subsets called “canopies” Rigorous stage  Uses rigorous and expensive metric  Clustering is applied only on canopy
  17. 17. Flowchart of Canopy Clustering 
  18. 18. Source: Ref [2]
  19. 19. Output of K-means onMathematica on Same Dataset 
  20. 20. Output of K-means on R on Same Dataset 
  21. 21. Output of K-means onMicrosoft Excel on Same Dataset 
  22. 22. Output of canopy on Excel on Same Dataset 
  23. 23. Complexity  Complexity of K-means is O(nk), where n is number of objects and k is number of centroids Canopy based K-means changes to O(nkf2/c)  c is no of canopies  f is average no of canopies that each data point falls into As f is very small number and c is comparatively big, the complexity is reduced
  24. 24. Conclusion  Implemented K-means Algorithm Verified Result on Mathematica, R Implemented Canopy Clustering Verified Result on Excel
  25. 25. Future Enhancement  Learning Hadoop and MapReduce Parallelizing K-Means based on MapReduce and comparing the implementation Running All the of K-means on standard dataset
  26. 26. References  Anil K. Jain, “Data Clustering: 50 Years Beyond K- Means” Andrew McCallum et al., “Efficient Clustering ofHigh Dimensional Data Sets with Application toReference Matching”
  27. 27. Thank You
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×