Your SlideShare is downloading. ×
0
Data Applied: Clustering
Data Applied: Clustering
Data Applied: Clustering
Data Applied: Clustering
Data Applied: Clustering
Data Applied: Clustering
Data Applied: Clustering
Data Applied: Clustering
Data Applied: Clustering
Data Applied: Clustering
Data Applied: Clustering
Data Applied: Clustering
Data Applied: Clustering
Data Applied: Clustering
Data Applied: Clustering
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Data Applied: Clustering

450

Published on

Data Applied: Clustering

Data Applied: Clustering

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
450
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. 5<br /> Data-Applied.com: Clustering<br />
  • 2. Introduction<br />Definition:<br />Clustering techniques apply when rather than predicting the class, we just want the instances to be divided into natural group<br />Problem statement:<br />Given the desired number of clusters K and a dataset of N points, and a distance based measurement function, we have to find a partition of the dataset that minimizes the value of measurement function<br />
  • 3. Algorithm: BIRCH<br />Data applied uses BIRCH algorithm for clustering<br />BIRCH stands for Balanced Iterative Reducing and Clustering using hierarchies <br />BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data-points to try to produce the best quality clustering with the available resources<br />
  • 4. Why BIRCH?<br />Takes into consideration the fact that it might not be possible to have the whole data set in memory, as required by many other algorithms<br />Minimize the time required for I/O by providing results on the basis of a single scan of data<br />It also gives an option to handle outliers<br />
  • 5. Background<br />Given a N d-dimensional data set: Xi for i varying from 1 to N:<br />Centroid(X0):<br /> X0 = summation(Xi)/N<br />Radius R:<br />R = (summation((Xi-X0)^2/N))^(1/2)<br />Diameter D: <br />D = ((summation(Xi – Xj)^2)/(N(N-1))) ^(1/2)<br />
  • 6. Background<br />Plus we also define 5 more distance metrics:<br />D0: Euclidian distance<br />D1: Manhattan distance<br />D2: Average inter cluster distance <br />D3: Average intra cluster distance<br />D4: Variance increase distance<br />
  • 7. Clustering feature<br />Clustering feature is a triple summarizing the information that we maintain about the cluster<br />CF = (N, LS, SS)<br />N is the number of data points in the cluster<br />LS is linear sum of data points<br />SS is square sum of data points<br />Additive theory: If CF1 and CF2 are two disjoint cluster than the merged cluster is represented as CF1 + CF2<br />We can calculate D0,D1… D4 using clustering features<br />
  • 8. CF Trees<br />A CF tree is a height balanced tree with two parameters: branching factor B and threshold T <br />Each non-leaf node contains at most B entries of the form [CF_i,child_i], where $child_i $ is a pointer to its ith child node and $CF_i$ is the subcluster represented by this child<br />A leaf node contains at most L entries each of the form $[CF_i]$ . It also has to two pointers prev and next which are used to chain all leaf nodes together<br />The tree size is a function of T. The larger the T is, the smaller the tree <br />
  • 9. BIRCH Algorithm<br />
  • 10. Building of CF tree<br />
  • 11. Using the DA-API’s web interface to execute Clustering<br />
  • 12. Step 1: Selection of data<br />
  • 13. Step 2: Select Cluster<br />
  • 14. Step 3: Clustering Result<br />
  • 15. Visit more self help tutorials<br /><ul><li>Pick a tutorial of your choice and browse through it at your own pace.
  • 16. The tutorials section is free, self-guiding and will not involve any additional support.
  • 17. Visit us at www.dataminingtools.net</li>

×