Birch

Birch Algorithm
Presented by:
Binod Malla
09BIM012

Introduction
•Balance
•Iterative
•Reducing &
•Clustering using
•Hierarchies

Definition
• Effective data clustering method for very large databases
• Each clustering decision is made without scanning all data points
• unsupervised data mining algorithm used to perform hierarchical
clustering

Data Clustering
• It is portioning of a database into clusters
• Closely packed group
• Collection of data objects that are similar to one another & treated
collectively in a group

Clustering Factor CF
• CF is compact storage for data on points in cluster
• Has enough information to calculate the intra-cluster distances
• Additive theorem allows to merge sub-clusters

BIRCH goal
• Minimizes running time and data scans
• Clustering decision made without scanning whole data
• Exploit the non uniformity of the data
• Treat dense areas as one and reduce noise

Algorithm
• Phase 1: Scan all data and build an initial in-memory CF tree, using
the given amount of memory and recycling space on disk.
• Phase 2: Condense into desirable length by building a smaller CF tree.
• Phase 3: Global clustering.
• Phase 4: Cluster refining – this is optional, and requires more passes
over the data to refine the results.

Birch- Phase 1
• Starts with initial threshold and inserts points into the tree.
• If it runs out of memory before it finishes scanning the data,
it increases the threshold value and rebuilds a new, smaller CF tree,
by re-inserting the leaf entries from the older tree and then other values
• Good initial threshold is important but hard to figure out.
• Outlier removal (when rebuilding tree).

Birch- Phase2
• Optional
• Preparation for phase 3
• Removes outliers and grouping clusters

Birch- Phase3
Problem after phase 1
• Input order affects results.
• Splitting triggered by node size.
Phase 3
• Cluster all leaf node on the CF values according to an existing
algorithm
• Algorithm used here: agglomerative hierarchical clustering

Birch-Phase4
• Optional
• Additional scans of the datasets attaching each items to the centroids
found.
• Recalculating the centroid and redistributing the items
• Always converges

Conclusion
• Birch performs faster than existing algorithms
(CLARANS and KMEANS) on large databases.
• Scans whole data only once.
• Handles outliers better.
• Superior to other algorithms in stability and scalability.

Birch

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to Birch

Similar to Birch (20)

Birch