3. Definition
• Effective data clustering method for very large databases
• Each clustering decision is made without scanning all data points
• unsupervised data mining algorithm used to perform hierarchical
clustering
4. Data Clustering
• It is portioning of a database into clusters
• Closely packed group
• Collection of data objects that are similar to one another & treated
collectively in a group
5. Clustering Factor CF
• CF is compact storage for data on points in cluster
• Has enough information to calculate the intra-cluster distances
• Additive theorem allows to merge sub-clusters
6. BIRCH goal
• Minimizes running time and data scans
• Clustering decision made without scanning whole data
• Exploit the non uniformity of the data
• Treat dense areas as one and reduce noise
7. Algorithm
• Phase 1: Scan all data and build an initial in-memory CF tree, using
the given amount of memory and recycling space on disk.
• Phase 2: Condense into desirable length by building a smaller CF tree.
• Phase 3: Global clustering.
• Phase 4: Cluster refining – this is optional, and requires more passes
over the data to refine the results.
8. Birch- Phase 1
• Starts with initial threshold and inserts points into the tree.
• If it runs out of memory before it finishes scanning the data,
it increases the threshold value and rebuilds a new, smaller CF tree,
by re-inserting the leaf entries from the older tree and then other values
• Good initial threshold is important but hard to figure out.
• Outlier removal (when rebuilding tree).
10. Birch- Phase3
Problem after phase 1
• Input order affects results.
• Splitting triggered by node size.
Phase 3
• Cluster all leaf node on the CF values according to an existing
algorithm
• Algorithm used here: agglomerative hierarchical clustering
11. Birch-Phase4
• Optional
• Additional scans of the datasets attaching each items to the centroids
found.
• Recalculating the centroid and redistributing the items
• Always converges
12. Conclusion
• Birch performs faster than existing algorithms
(CLARANS and KMEANS) on large databases.
• Scans whole data only once.
• Handles outliers better.
• Superior to other algorithms in stability and scalability.