Upcoming SlideShare
×

# Clustering

352

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total Views
352
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
15
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Clustering

1. 1. Presented By,Manasi C. KadamSharmishtha P. AlwekarGanesh H. SatputeDeepak D. AmbegaonkarRajesh V. Dulhani Under the guidance Prof. G. A. Patil Mr. Varad Meru
2. 2. Agenda  Introduction Clustering K-means clustering algorithm Canopy clustering algorithm Complexity Evaluation Conclusion Future Enhancement References
3. 3. Introduction  Tedious task to maintain large Data Types 1. Structured 2. Unstructured
4. 4. Introduction to Data analysis  Extracting information out of data Two types 1. Exploratory or descriptive 2. Confirmative or inferential
5. 5. Clustering (Aka Unsupervised Learning)  Goal is to discover the natural grouping(s) between objects Given n objects find K groups on measure of “similarity” Organizing data into clusters such that there is • high intra-cluster similarity • low inter-cluster similarity Ideal cluster - set of points that is compact and isolated Ex. K-means algorithm, k-medoids etc.
6. 6.
7. 7. Problems in clustering  Cluster can differ in size, shape & density Presence of noise Cluster is a subjective entity Automation
8. 8. Clustering Algorithm  Types of Clustering Algorithm 1. Hierarchical 2. Partitional Hierarchical – recursively finds nested clusters  Types 1. Agglomerative 2. Divisive Partitional - finds all the clusters simultaneously ex. K-means
9. 9. K-means algorithm 
10. 10. K-means Algorithm (contd.)  Goal of K-means is to minimize the sum of the squared error over all K clusters
11. 11. Flowchart 
12. 12. Class Diagram of K-means 
13. 13. Parameter for K-means  Most critical choice is K  Typically algorithm is run for various values of K and most appropriate output is selected Different initialization can lead to different output
14. 14. Canopy Clustering  Traditional clustering algorithm works well when dataset has either property.  Large number of clusters  A high feature dimensionality  Large number of data points. When dataset has all three property at once computation becomes expensive. This necessitates need of new technique, thus canopy clustering
15. 15. Canopy Clustering (contd.)  Performs clustering in two stages 1. Rough and quick stage 2. Rigorous stage
16. 16. Canopy Clustering (contd.)  Rough and quick stage  Uses extremely inexpensive method  divides the data into overlapping subsets called “canopies” Rigorous stage  Uses rigorous and expensive metric  Clustering is applied only on canopy
17. 17. Flowchart of Canopy Clustering 
18. 18. Source: Ref [2]
19. 19. Output of K-means onMathematica on Same Dataset 
20. 20. Output of K-means on R on Same Dataset 
21. 21. Output of K-means onMicrosoft Excel on Same Dataset 
22. 22. Output of canopy on Excel on Same Dataset 
23. 23. Complexity  Complexity of K-means is O(nk), where n is number of objects and k is number of centroids Canopy based K-means changes to O(nkf2/c)  c is no of canopies  f is average no of canopies that each data point falls into As f is very small number and c is comparatively big, the complexity is reduced
24. 24. Conclusion  Implemented K-means Algorithm Verified Result on Mathematica, R Implemented Canopy Clustering Verified Result on Excel
25. 25. Future Enhancement  Learning Hadoop and MapReduce Parallelizing K-Means based on MapReduce and comparing the implementation Running All the of K-means on standard dataset
26. 26. References  Anil K. Jain, “Data Clustering: 50 Years Beyond K- Means” Andrew McCallum et al., “Efficient Clustering ofHigh Dimensional Data Sets with Application toReference Matching”
27. 27. Thank You
1. #### A particular slide catching your eye?

Clipping is a handy way to collect important slides you want to go back to later.