Your SlideShare is downloading. ×
  • Like
  • Save
Data Applied: Clustering
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Data Applied: Clustering

  • 421 views
Published

Data Applied: Clustering

Data Applied: Clustering

Published in Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
421
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 5
    Data-Applied.com: Clustering
  • 2. Introduction
    Definition:
    Clustering techniques apply when rather than predicting the class, we just want the instances to be divided into natural group
    Problem statement:
    Given the desired number of clusters K and a dataset of N points, and a distance based measurement function, we have to find a partition of the dataset that minimizes the value of measurement function
  • 3. Algorithm: BIRCH
    Data applied uses BIRCH algorithm for clustering
    BIRCH stands for Balanced Iterative Reducing and Clustering using hierarchies
    BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data-points to try to produce the best quality clustering with the available resources
  • 4. Why BIRCH?
    Takes into consideration the fact that it might not be possible to have the whole data set in memory, as required by many other algorithms
    Minimize the time required for I/O by providing results on the basis of a single scan of data
    It also gives an option to handle outliers
  • 5. Background
    Given a N d-dimensional data set: Xi for i varying from 1 to N:
    Centroid(X0):
    X0 = summation(Xi)/N
    Radius R:
    R = (summation((Xi-X0)^2/N))^(1/2)
    Diameter D:
    D = ((summation(Xi – Xj)^2)/(N(N-1))) ^(1/2)
  • 6. Background
    Plus we also define 5 more distance metrics:
    D0: Euclidian distance
    D1: Manhattan distance
    D2: Average inter cluster distance
    D3: Average intra cluster distance
    D4: Variance increase distance
  • 7. Clustering feature
    Clustering feature is a triple summarizing the information that we maintain about the cluster
    CF = (N, LS, SS)
    N is the number of data points in the cluster
    LS is linear sum of data points
    SS is square sum of data points
    Additive theory: If CF1 and CF2 are two disjoint cluster than the merged cluster is represented as CF1 + CF2
    We can calculate D0,D1… D4 using clustering features
  • 8. CF Trees
    A CF tree is a height balanced tree with two parameters: branching factor B and threshold T 
    Each non-leaf node contains at most B entries of the form [CF_i,child_i], where $child_i $ is a pointer to its ith child node and $CF_i$ is the subcluster represented by this child
    A leaf node contains at most L entries each of the form $[CF_i]$ . It also has to two pointers prev and next which are used to chain all leaf nodes together
    The tree size is a function of T. The larger the T is, the smaller the tree
  • 9. BIRCH Algorithm
  • 10. Building of CF tree
  • 11. Using the DA-API’s web interface to execute Clustering
  • 12. Step 1: Selection of data
  • 13. Step 2: Select Cluster
  • 14. Step 3: Clustering Result
  • 15. Visit more self help tutorials
    • Pick a tutorial of your choice and browse through it at your own pace.
    • 16. The tutorials section is free, self-guiding and will not involve any additional support.
    • 17. Visit us at www.dataminingtools.net