5 Data-Applied.com: Clustering
IntroductionDefinition:Clustering techniques apply when rather than predicting the class, we just want the instances to be divided into natural groupProblem statement:Given the desired number of clusters K and a dataset of N points, and a distance based measurement function, we have to find a partition of the dataset that minimizes the value of measurement function
Algorithm: BIRCHData applied uses BIRCH algorithm for clusteringBIRCH stands for Balanced Iterative Reducing and Clustering using hierarchies BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data-points to try to produce the best quality clustering with the available resources
Why BIRCH?Takes into consideration the fact that it might not be possible to have the whole data set in memory, as required by many other algorithmsMinimize the time required for I/O by providing results on the basis of a single scan of dataIt also gives an option to handle outliers
BackgroundGiven a N d-dimensional data set: Xi for i varying from 1 to N:Centroid(X0):                   X0 = summation(Xi)/NRadius R:R = (summation((Xi-X0)^2/N))^(1/2)Diameter D: D = ((summation(Xi – Xj)^2)/(N(N-1))) ^(1/2)
BackgroundPlus we also define 5 more distance metrics:D0: Euclidian distanceD1: Manhattan distanceD2: Average inter cluster distance D3: Average intra cluster distanceD4: Variance increase distance
Clustering featureClustering feature is a triple summarizing the information that we maintain about the clusterCF = (N, LS, SS)N is the number of data points in the clusterLS is linear sum of data pointsSS is square sum of data pointsAdditive theory: If CF1 and CF2 are two disjoint cluster than the merged cluster is represented as CF1 + CF2We can calculate D0,D1… D4 using clustering features
CF TreesA CF tree is a height balanced tree with two parameters: branching factor B and threshold T Each non-leaf node contains at most B entries of the form [CF_i,child_i], where $child_i $ is a pointer to its ith child node and $CF_i$ is the subcluster represented by this childA leaf node contains at most L entries each of the form $[CF_i]$ . It also has to two pointers prev and next which are used to chain all leaf nodes togetherThe tree size is a function of T. The larger the T is, the smaller the tree
BIRCH Algorithm
Building of CF tree
Using the DA-API’s web interface to execute Clustering
Step 1: Selection of data
Step 2: Select Cluster
Step 3: Clustering Result
Visit more self help tutorialsPick a tutorial of your choice and browse through it at your own pace.

Data Applied: Clustering

  • 1.
  • 2.
    IntroductionDefinition:Clustering techniques applywhen rather than predicting the class, we just want the instances to be divided into natural groupProblem statement:Given the desired number of clusters K and a dataset of N points, and a distance based measurement function, we have to find a partition of the dataset that minimizes the value of measurement function
  • 3.
    Algorithm: BIRCHData applieduses BIRCH algorithm for clusteringBIRCH stands for Balanced Iterative Reducing and Clustering using hierarchies BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data-points to try to produce the best quality clustering with the available resources
  • 4.
    Why BIRCH?Takes intoconsideration the fact that it might not be possible to have the whole data set in memory, as required by many other algorithmsMinimize the time required for I/O by providing results on the basis of a single scan of dataIt also gives an option to handle outliers
  • 5.
    BackgroundGiven a Nd-dimensional data set: Xi for i varying from 1 to N:Centroid(X0): X0 = summation(Xi)/NRadius R:R = (summation((Xi-X0)^2/N))^(1/2)Diameter D: D = ((summation(Xi – Xj)^2)/(N(N-1))) ^(1/2)
  • 6.
    BackgroundPlus we alsodefine 5 more distance metrics:D0: Euclidian distanceD1: Manhattan distanceD2: Average inter cluster distance D3: Average intra cluster distanceD4: Variance increase distance
  • 7.
    Clustering featureClustering featureis a triple summarizing the information that we maintain about the clusterCF = (N, LS, SS)N is the number of data points in the clusterLS is linear sum of data pointsSS is square sum of data pointsAdditive theory: If CF1 and CF2 are two disjoint cluster than the merged cluster is represented as CF1 + CF2We can calculate D0,D1… D4 using clustering features
  • 8.
    CF TreesA CFtree is a height balanced tree with two parameters: branching factor B and threshold T Each non-leaf node contains at most B entries of the form [CF_i,child_i], where $child_i $ is a pointer to its ith child node and $CF_i$ is the subcluster represented by this childA leaf node contains at most L entries each of the form $[CF_i]$ . It also has to two pointers prev and next which are used to chain all leaf nodes togetherThe tree size is a function of T. The larger the T is, the smaller the tree
  • 9.
  • 10.
  • 11.
    Using the DA-API’sweb interface to execute Clustering
  • 12.
  • 13.
  • 14.
  • 15.
    Visit more selfhelp tutorialsPick a tutorial of your choice and browse through it at your own pace.