Upcoming SlideShare
×

# Data Applied: Clustering

606 views

Published on

Data Applied: Clustering

Published in: Technology, Education
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
606
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
0
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Data Applied: Clustering

1. 1. 5<br /> Data-Applied.com: Clustering<br />
2. 2. Introduction<br />Definition:<br />Clustering techniques apply when rather than predicting the class, we just want the instances to be divided into natural group<br />Problem statement:<br />Given the desired number of clusters K and a dataset of N points, and a distance based measurement function, we have to find a partition of the dataset that minimizes the value of measurement function<br />
3. 3. Algorithm: BIRCH<br />Data applied uses BIRCH algorithm for clustering<br />BIRCH stands for Balanced Iterative Reducing and Clustering using hierarchies <br />BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data-points to try to produce the best quality clustering with the available resources<br />
4. 4. Why BIRCH?<br />Takes into consideration the fact that it might not be possible to have the whole data set in memory, as required by many other algorithms<br />Minimize the time required for I/O by providing results on the basis of a single scan of data<br />It also gives an option to handle outliers<br />
5. 5. Background<br />Given a N d-dimensional data set: Xi for i varying from 1 to N:<br />Centroid(X0):<br /> X0 = summation(Xi)/N<br />Radius R:<br />R = (summation((Xi-X0)^2/N))^(1/2)<br />Diameter D: <br />D = ((summation(Xi – Xj)^2)/(N(N-1))) ^(1/2)<br />
6. 6. Background<br />Plus we also define 5 more distance metrics:<br />D0: Euclidian distance<br />D1: Manhattan distance<br />D2: Average inter cluster distance <br />D3: Average intra cluster distance<br />D4: Variance increase distance<br />
7. 7. Clustering feature<br />Clustering feature is a triple summarizing the information that we maintain about the cluster<br />CF = (N, LS, SS)<br />N is the number of data points in the cluster<br />LS is linear sum of data points<br />SS is square sum of data points<br />Additive theory: If CF1 and CF2 are two disjoint cluster than the merged cluster is represented as CF1 + CF2<br />We can calculate D0,D1… D4 using clustering features<br />
8. 8. CF Trees<br />A CF tree is a height balanced tree with two parameters: branching factor B and threshold T <br />Each non-leaf node contains at most B entries of the form [CF_i,child_i], where \$child_i \$ is a pointer to its ith child node and \$CF_i\$ is the subcluster represented by this child<br />A leaf node contains at most L entries each of the form \$[CF_i]\$ . It also has to two pointers prev and next which are used to chain all leaf nodes together<br />The tree size is a function of T. The larger the T is, the smaller the tree <br />
9. 9. BIRCH Algorithm<br />
10. 10. Building of CF tree<br />
11. 11. Using the DA-API’s web interface to execute Clustering<br />
12. 12. Step 1: Selection of data<br />
13. 13. Step 2: Select Cluster<br />
14. 14. Step 3: Clustering Result<br />
15. 15. Visit more self help tutorials<br /><ul><li>Pick a tutorial of your choice and browse through it at your own pace.
16. 16. The tutorials section is free, self-guiding and will not involve any additional support.
17. 17. Visit us at www.dataminingtools.net</li>