Data Applied: Clustering


Published on

Data Applied: Clustering

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data Applied: Clustering

  1. 1. 5<br /> Clustering<br />
  2. 2. Introduction<br />Definition:<br />Clustering techniques apply when rather than predicting the class, we just want the instances to be divided into natural group<br />Problem statement:<br />Given the desired number of clusters K and a dataset of N points, and a distance based measurement function, we have to find a partition of the dataset that minimizes the value of measurement function<br />
  3. 3. Algorithm: BIRCH<br />Data applied uses BIRCH algorithm for clustering<br />BIRCH stands for Balanced Iterative Reducing and Clustering using hierarchies <br />BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data-points to try to produce the best quality clustering with the available resources<br />
  4. 4. Why BIRCH?<br />Takes into consideration the fact that it might not be possible to have the whole data set in memory, as required by many other algorithms<br />Minimize the time required for I/O by providing results on the basis of a single scan of data<br />It also gives an option to handle outliers<br />
  5. 5. Background<br />Given a N d-dimensional data set: Xi for i varying from 1 to N:<br />Centroid(X0):<br /> X0 = summation(Xi)/N<br />Radius R:<br />R = (summation((Xi-X0)^2/N))^(1/2)<br />Diameter D: <br />D = ((summation(Xi – Xj)^2)/(N(N-1))) ^(1/2)<br />
  6. 6. Background<br />Plus we also define 5 more distance metrics:<br />D0: Euclidian distance<br />D1: Manhattan distance<br />D2: Average inter cluster distance <br />D3: Average intra cluster distance<br />D4: Variance increase distance<br />
  7. 7. Clustering feature<br />Clustering feature is a triple summarizing the information that we maintain about the cluster<br />CF = (N, LS, SS)<br />N is the number of data points in the cluster<br />LS is linear sum of data points<br />SS is square sum of data points<br />Additive theory: If CF1 and CF2 are two disjoint cluster than the merged cluster is represented as CF1 + CF2<br />We can calculate D0,D1… D4 using clustering features<br />
  8. 8. CF Trees<br />A CF tree is a height balanced tree with two parameters: branching factor B and threshold T <br />Each non-leaf node contains at most B entries of the form [CF_i,child_i], where $child_i $ is a pointer to its ith child node and $CF_i$ is the subcluster represented by this child<br />A leaf node contains at most L entries each of the form $[CF_i]$ . It also has to two pointers prev and next which are used to chain all leaf nodes together<br />The tree size is a function of T. The larger the T is, the smaller the tree <br />
  9. 9. BIRCH Algorithm<br />
  10. 10. Building of CF tree<br />
  11. 11. Using the DA-API’s web interface to execute Clustering<br />
  12. 12. Step 1: Selection of data<br />
  13. 13. Step 2: Select Cluster<br />
  14. 14. Step 3: Clustering Result<br />
  15. 15. Visit more self help tutorials<br /><ul><li>Pick a tutorial of your choice and browse through it at your own pace.
  16. 16. The tutorials section is free, self-guiding and will not involve any additional support.
  17. 17. Visit us at</li>