2. Introduction<br />Definition:<br />Clustering techniques apply when rather than predicting the class, we just want the instances to be divided into natural group<br />Problem statement:<br />Given the desired number of clusters K and a dataset of N points, and a distance based measurement function, we have to find a partition of the dataset that minimizes the value of measurement function<br />
3. Algorithm: BIRCH<br />Data applied uses BIRCH algorithm for clustering<br />BIRCH stands for Balanced Iterative Reducing and Clustering using hierarchies <br />BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data-points to try to produce the best quality clustering with the available resources<br />
4. Why BIRCH?<br />Takes into consideration the fact that it might not be possible to have the whole data set in memory, as required by many other algorithms<br />Minimize the time required for I/O by providing results on the basis of a single scan of data<br />It also gives an option to handle outliers<br />
5. Background<br />Given a N d-dimensional data set: Xi for i varying from 1 to N:<br />Centroid(X0):<br /> X0 = summation(Xi)/N<br />Radius R:<br />R = (summation((Xi-X0)^2/N))^(1/2)<br />Diameter D: <br />D = ((summation(Xi – Xj)^2)/(N(N-1))) ^(1/2)<br />
6. Background<br />Plus we also define 5 more distance metrics:<br />D0: Euclidian distance<br />D1: Manhattan distance<br />D2: Average inter cluster distance <br />D3: Average intra cluster distance<br />D4: Variance increase distance<br />
7. Clustering feature<br />Clustering feature is a triple summarizing the information that we maintain about the cluster<br />CF = (N, LS, SS)<br />N is the number of data points in the cluster<br />LS is linear sum of data points<br />SS is square sum of data points<br />Additive theory: If CF1 and CF2 are two disjoint cluster than the merged cluster is represented as CF1 + CF2<br />We can calculate D0,D1… D4 using clustering features<br />
8. CF Trees<br />A CF tree is a height balanced tree with two parameters: branching factor B and threshold T <br />Each non-leaf node contains at most B entries of the form [CF_i,child_i], where $child_i $ is a pointer to its ith child node and $CF_i$ is the subcluster represented by this child<br />A leaf node contains at most L entries each of the form $[CF_i]$ . It also has to two pointers prev and next which are used to chain all leaf nodes together<br />The tree size is a function of T. The larger the T is, the smaller the tree <br />
9. BIRCH Algorithm<br />
10. Building of CF tree<br />
11. Using the DA-API’s web interface to execute Clustering<br />
12. Step 1: Selection of data<br />
13. Step 2: Select Cluster<br />
14. Step 3: Clustering Result<br />
15. Visit more self help tutorials<br /><ul><li>Pick a tutorial of your choice and browse through it at your own pace.
16. The tutorials section is free, self-guiding and will not involve any additional support.