Data Applied: Clustering

5 Data-Applied.com: Clustering

IntroductionDefinition:Clustering techniques apply when rather than predicting the class, we just want the instances to be divided into natural groupProblem statement:Given the desired number of clusters K and a dataset of N points, and a distance based measurement function, we have to find a partition of the dataset that minimizes the value of measurement function

Algorithm: BIRCHData applied uses BIRCH algorithm for clusteringBIRCH stands for Balanced Iterative Reducing and Clustering using hierarchies BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data-points to try to produce the best quality clustering with the available resources

Why BIRCH?Takes into consideration the fact that it might not be possible to have the whole data set in memory, as required by many other algorithmsMinimize the time required for I/O by providing results on the basis of a single scan of dataIt also gives an option to handle outliers

BackgroundGiven a N d-dimensional data set: Xi for i varying from 1 to N:Centroid(X0): X0 = summation(Xi)/NRadius R:R = (summation((Xi-X0)^2/N))^(1/2)Diameter D: D = ((summation(Xi – Xj)^2)/(N(N-1))) ^(1/2)

BackgroundPlus we also define 5 more distance metrics:D0: Euclidian distanceD1: Manhattan distanceD2: Average inter cluster distance D3: Average intra cluster distanceD4: Variance increase distance

Clustering featureClustering feature is a triple summarizing the information that we maintain about the clusterCF = (N, LS, SS)N is the number of data points in the clusterLS is linear sum of data pointsSS is square sum of data pointsAdditive theory: If CF1 and CF2 are two disjoint cluster than the merged cluster is represented as CF1 + CF2We can calculate D0,D1… D4 using clustering features

CF TreesA CF tree is a height balanced tree with two parameters: branching factor B and threshold T Each non-leaf node contains at most B entries of the form [CF_i,child_i], where $child_i $ is a pointer to its ith child node and $CF_i$ is the subcluster represented by this childA leaf node contains at most L entries each of the form $[CF_i]$ . It also has to two pointers prev and next which are used to chain all leaf nodes togetherThe tree size is a function of T. The larger the T is, the smaller the tree

Using the DA-API’s web interface to execute Clustering

Visit more self help tutorialsPick a tutorial of your choice and browse through it at your own pace.

Data Applied: Clustering

More Related Content

What's hot

Viewers also liked

Similar to Data Applied: Clustering

More from DataminingTools Inc

Recently uploaded

Data Applied: Clustering