Birch Algorithm
Presented by:
Binod Malla
09BIM012
Introduction
•Balance
•Iterative
•Reducing &
•Clustering using
•Hierarchies
Definition
• Effective data clustering method for very large databases
• Each clustering decision is made without scanning all data points
• unsupervised data mining algorithm used to perform hierarchical
clustering
Data Clustering
• It is portioning of a database into clusters
• Closely packed group
• Collection of data objects that are similar to one another & treated
collectively in a group
Clustering Factor CF
• CF is compact storage for data on points in cluster
• Has enough information to calculate the intra-cluster distances
• Additive theorem allows to merge sub-clusters
BIRCH goal
• Minimizes running time and data scans
• Clustering decision made without scanning whole data
• Exploit the non uniformity of the data
• Treat dense areas as one and reduce noise
Algorithm
• Phase 1: Scan all data and build an initial in-memory CF tree, using
the given amount of memory and recycling space on disk.
• Phase 2: Condense into desirable length by building a smaller CF tree.
• Phase 3: Global clustering.
• Phase 4: Cluster refining – this is optional, and requires more passes
over the data to refine the results.
Birch- Phase 1
• Starts with initial threshold and inserts points into the tree.
• If it runs out of memory before it finishes scanning the data,
it increases the threshold value and rebuilds a new, smaller CF tree,
by re-inserting the leaf entries from the older tree and then other values
• Good initial threshold is important but hard to figure out.
• Outlier removal (when rebuilding tree).
Birch- Phase2
• Optional
• Preparation for phase 3
• Removes outliers and grouping clusters
Birch- Phase3
Problem after phase 1
• Input order affects results.
• Splitting triggered by node size.
Phase 3
• Cluster all leaf node on the CF values according to an existing
algorithm
• Algorithm used here: agglomerative hierarchical clustering
Birch-Phase4
• Optional
• Additional scans of the datasets attaching each items to the centroids
found.
• Recalculating the centroid and redistributing the items
• Always converges
Conclusion
• Birch performs faster than existing algorithms
(CLARANS and KMEANS) on large databases.
• Scans whole data only once.
• Handles outliers better.
• Superior to other algorithms in stability and scalability.

Birch

  • 1.
  • 2.
  • 3.
    Definition • Effective dataclustering method for very large databases • Each clustering decision is made without scanning all data points • unsupervised data mining algorithm used to perform hierarchical clustering
  • 4.
    Data Clustering • Itis portioning of a database into clusters • Closely packed group • Collection of data objects that are similar to one another & treated collectively in a group
  • 5.
    Clustering Factor CF •CF is compact storage for data on points in cluster • Has enough information to calculate the intra-cluster distances • Additive theorem allows to merge sub-clusters
  • 6.
    BIRCH goal • Minimizesrunning time and data scans • Clustering decision made without scanning whole data • Exploit the non uniformity of the data • Treat dense areas as one and reduce noise
  • 7.
    Algorithm • Phase 1:Scan all data and build an initial in-memory CF tree, using the given amount of memory and recycling space on disk. • Phase 2: Condense into desirable length by building a smaller CF tree. • Phase 3: Global clustering. • Phase 4: Cluster refining – this is optional, and requires more passes over the data to refine the results.
  • 8.
    Birch- Phase 1 •Starts with initial threshold and inserts points into the tree. • If it runs out of memory before it finishes scanning the data, it increases the threshold value and rebuilds a new, smaller CF tree, by re-inserting the leaf entries from the older tree and then other values • Good initial threshold is important but hard to figure out. • Outlier removal (when rebuilding tree).
  • 9.
    Birch- Phase2 • Optional •Preparation for phase 3 • Removes outliers and grouping clusters
  • 10.
    Birch- Phase3 Problem afterphase 1 • Input order affects results. • Splitting triggered by node size. Phase 3 • Cluster all leaf node on the CF values according to an existing algorithm • Algorithm used here: agglomerative hierarchical clustering
  • 11.
    Birch-Phase4 • Optional • Additionalscans of the datasets attaching each items to the centroids found. • Recalculating the centroid and redistributing the items • Always converges
  • 12.
    Conclusion • Birch performsfaster than existing algorithms (CLARANS and KMEANS) on large databases. • Scans whole data only once. • Handles outliers better. • Superior to other algorithms in stability and scalability.