Published on

birch algorithm

Published in: Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. OutlineBirch: An efficient data clusteringmethod for very large databases What is data clustering Data clustering applications Previous Approaches Tian Zhang, Raghu Ramakrishnan, Birch’s Goal Miron Livny Clustering Feature Birch clustering algorithm CPSC 504 Clustering example Presenter: Joel Lanir Discussion: Dan LiWhat is Data Clustering? Data Clustering Helps understand the naturalA cluster is a closely-packed group. grouping or structure in a datasetA collection of data objects that are Large set of multidimensional data similar to one another and treated Data space is usually not uniformly collectively as a group. occupied Identify the sparse and crowdedData Clustering is the partitioning places of a dataset into clusters Helps visualizationDiscussion Some Clustering applications Can you give some examples for very Biology – building groups of genes with related patterns large databases? What applications Marketing – partition the population of can you imagine that require such consumers to market segments large databases for clustering? Division of WWW pages into genres. Image segmentations – for object What are the special requirements recognition that “large” databases pose on Land use – Identification of areas of similar clustering, or more general on data land use from satellite images mining? Insurance – Identify groups of policy holders with high average claim cost 1
  2. 2. Data Clustering – previousapproaches Approaches Distance Based (statistics) Must be a distance metric between two items probability based (Machine learning): assumes that all data points are in memory and make wrong assumption that can be scanned frequently distributions on attributes are Ignores the fact that not all data points are equally important independent on each other Close data points are not gathered together Probability representations of clusters Inspects all data points on multiple iterations is expensive These approaches do not deal with dataset and memory size issues!Clustering parameters Clustering parameters Centroid – Euclidian center Other measurements (like the Radius – average distance to center Euclidean distance of the centroids of two clusters) will measure how far Diameter – average pairwise away two clusters are. difference within a cluster A good quality clustering will produceRadius and diameter are measures of high intra-clustering and low inter- the tightness of a cluster around its clustering center. We wish to keep these low. A good quality clustering can help find hidden patternsBirch’s goals: Clustering Feature (CF) Minimize running time and data CF is a compact storage for data on scans, thus formulating the problem points in a cluster for large databases Has enough information to calculate Clustering decisions made without the intra-cluster distances scanning the whole data Additivity theorem allows us to merge Exploit the non uniformity of data – sub-clusters treat dense areas as one, and remove outliers (noise) 2
  3. 3. Clustering Feature (CF) CF Additivity TheoremGiven N d-dimensional data points in a If CF1 = (N1, LS1, SS1), and cluster: {Xi} where i = 1, 2, …, N, CF2 = (N2 ,LS2, SS2) are the CF entries of CF = (N, LS, SS) two disjoint subclusters.N is the number of data points in the cluster, The CF entry of the subcluster formed byLS is the linear sum of the N data points, merging the two disjoin subclusters is:SS is the square sum of the N data points. CF1 + CF2 = (N1 + N2 , LS1 + LS2, SS1 + SS2) B = Max. no. of CF in a non-leaf nodeCF Tree L = Max. no. of CF in a leaf node Root CF TREE CF1 CF2 CF3 CFb child1 child2 child3 childb T is the threshold for the diameter or radius of the leaf nodes Non-leaf node The tree size is a function of T. The CF1 CF2 CF3 CFb bigger T is, the smaller the tree will child1 child2 child3 childb be. The CF tree is built dynamically as Leaf node Leaf node data is scanned.prev CF1 CF2 CFL next prev CF1 CF2 CFL next T= Max. radius of a sub-clusterCF Tree Insertion Birch Clustering Algorithm Identifying the appropriate leaf: recursively Phase 1: Scan all data and build an descending the CF tree and choosing the initial in-memory CF tree. closest child node according to a chosen distance metric Phase 2: condense into desirable Modifying the leaf: test whether the leaf length by building a smaller CF tree. can absorb the node without violating the Phase 3: Global clustering threshold. If there is no room, split the Phase 4: Cluster refining – this is node optional, and requires more passes Modifying the path: update CF information over the data to refine the results up the path. 3
  4. 4. Birch – Phase 1 Birch - Phase 2 Start with initial threshold and insert points Optional into the tree If run out of memory, increase threshold Phase 3 sometime have minimum value, and rebuild a smaller tree by size which performs well, so phase 2 reinserting values from older tree and then prepares the tree for phase 3. other values Removes outliers, and grouping Good initial threshold is important but hard to figure out clusters. Outlier removal – when rebuilding tree remove outliersBirch – Phase 3 Birch – Phase 4 Problems after phase 1: Optional Input order affects results Additional scan/s of the dataset, Splitting triggered by node size attaching each item to the centroids Phase 3: found. cluster all leaf nodes on the CF values Recalculating the centroids and according to an existing algorithm redistributing the items. Algorithm used here: agglomerative Always converges hierarchical clusteringClustering example Clustering example band224 K-means Clustering to 5 classesPixel classification in imagesFrom top to bottom: BIRCH classification Visible wavelength band Near-infrared band band2 band1 4
  5. 5. Conclusions Discussion Birch performs faster than then After reading the two papers for data existing algorithms on large datasets mining, what do you think is the criteria to say if a data mining Scans whole data only once algorithm is “good”? Handles outliers Efficiency? I/O cost? Memory/disk requirement? Stability? Immunity to abnormal data? Thanks for listening 5