1.
Presented By,Manasi C. KadamSharmishtha P. AlwekarGanesh H. SatputeDeepak D. AmbegaonkarRajesh V. Dulhani Under the guidance Prof. G. A. Patil Mr. Varad Meru
3.
Introduction Tedious task to maintain large Data Types 1. Structured 2. Unstructured
4.
Introduction to Data analysis Extracting information out of data Two types 1. Exploratory or descriptive 2. Confirmative or inferential
5.
Clustering (Aka Unsupervised Learning) Goal is to discover the natural grouping(s) between objects Given n objects find K groups on measure of “similarity” Organizing data into clusters such that there is • high intra-cluster similarity • low inter-cluster similarity Ideal cluster - set of points that is compact and isolated Ex. K-means algorithm, k-medoids etc.
13.
Parameter for K-means Most critical choice is K Typically algorithm is run for various values of K and most appropriate output is selected Different initialization can lead to different output
14.
Canopy Clustering Traditional clustering algorithm works well when dataset has either property. Large number of clusters A high feature dimensionality Large number of data points. When dataset has all three property at once computation becomes expensive. This necessitates need of new technique, thus canopy clustering
15.
Canopy Clustering (contd.) Performs clustering in two stages 1. Rough and quick stage 2. Rigorous stage
16.
Canopy Clustering (contd.) Rough and quick stage Uses extremely inexpensive method divides the data into overlapping subsets called “canopies” Rigorous stage Uses rigorous and expensive metric Clustering is applied only on canopy
23.
Complexity Complexity of K-means is O(nk), where n is number of objects and k is number of centroids Canopy based K-means changes to O(nkf2/c) c is no of canopies f is average no of canopies that each data point falls into As f is very small number and c is comparatively big, the complexity is reduced
24.
Conclusion Implemented K-means Algorithm Verified Result on Mathematica, R Implemented Canopy Clustering Verified Result on Excel
25.
Future Enhancement Learning Hadoop and MapReduce Parallelizing K-Means based on MapReduce and comparing the implementation Running All the of K-means on standard dataset
26.
References Anil K. Jain, “Data Clustering: 50 Years Beyond K- Means” Andrew McCallum et al., “Efficient Clustering ofHigh Dimensional Data Sets with Application toReference Matching”
Be the first to comment