SlideShare a Scribd company logo
1 of 5
Download to read offline
Outline
Birch: An efficient data clustering
method for very large databases          What is data clustering
                                         Data clustering applications
                                         Previous Approaches
    Tian Zhang, Raghu Ramakrishnan,      Birch’s Goal
    Miron Livny                          Clustering Feature
                                         Birch clustering algorithm
          CPSC 504
                                         Clustering example
          Presenter: Joel Lanir
          Discussion: Dan Li




What is Data Clustering?                Data Clustering
                                         Helps understand the natural
A cluster is a closely-packed group.     grouping or structure in a dataset
A collection of data objects that are    Large set of multidimensional data
  similar to one another and treated     Data space is usually not uniformly
  collectively as a group.               occupied
                                          Identify the sparse and crowded
Data Clustering is the partitioning      places
  of a dataset into clusters             Helps visualization




Discussion                              Some Clustering applications
  Can you give some examples for very    Biology – building groups of genes with
                                         related patterns
  large databases? What applications     Marketing – partition the population of
  can you imagine that require such      consumers to market segments
  large databases for clustering?        Division of WWW pages into genres.
                                         Image segmentations – for object
  What are the special requirements      recognition
  that “large” databases pose on         Land use – Identification of areas of similar
  clustering, or more general on data    land use from satellite images
  mining?                                Insurance – Identify groups of policy
                                         holders with high average claim cost




                                                                                         1
Data Clustering – previous
approaches                                  Approaches
                                            Distance Based (statistics)
                                                 Must be a distance metric between two items
  probability based (Machine learning):          assumes that all data points are in memory and
  make wrong assumption that                     can be scanned frequently
  distributions on attributes are                Ignores the fact that not all data points are
                                                 equally important
  independent on each other                      Close data points are not gathered together
  Probability representations of clusters        Inspects all data points on multiple iterations
  is expensive
                                            These approaches do not deal with dataset
                                              and memory size issues!




Clustering parameters                       Clustering parameters
  Centroid – Euclidian center                 Other measurements (like the
  Radius – average distance to center         Euclidean distance of the centroids of
                                              two clusters) will measure how far
  Diameter – average pairwise                 away two clusters are.
  difference within a cluster
                                            A good quality clustering will produce
Radius and diameter are measures of           high intra-clustering and low inter-
  the tightness of a cluster around its       clustering
  center. We wish to keep these low.        A good quality clustering can help find
                                              hidden patterns




Birch’s goals:                              Clustering Feature (CF)
  Minimize running time and data              CF is a compact storage for data on
  scans, thus formulating the problem         points in a cluster
  for large databases                         Has enough information to calculate
  Clustering decisions made without           the intra-cluster distances
  scanning the whole data                     Additivity theorem allows us to merge
  Exploit the non uniformity of data –        sub-clusters
  treat dense areas as one, and remove
  outliers (noise)




                                                                                                   2
Clustering Feature (CF)                                                                    CF Additivity Theorem
Given N d-dimensional data points in a                                                     If CF1 = (N1, LS1, SS1), and
  cluster: {Xi} where i = 1, 2, …, N,                                                      CF2 = (N2 ,LS2, SS2) are the CF entries of
     CF = (N, LS, SS)                                                                         two disjoint subclusters.
N is the number of data points in the
  cluster,                                                                                 The CF entry of the subcluster formed by
LS is the linear sum of the N data points,                                                   merging the two disjoin subclusters is:
SS is the square sum of the N data
  points.                                                                                  CF1 + CF2 = (N1 + N2 , LS1 + LS2, SS1 + SS2)




                                                   B = Max. no. of CF in a non-leaf node
CF Tree                                            L = Max. no. of CF in a leaf node

                                 Root                                                      CF TREE
               CF1      CF2 CF3                           CFb
               child1   child2 child3                     childb                             T is the threshold for the diameter or
                                                                                             radius of the leaf nodes
                     Non-leaf node                                                           The tree size is a function of T. The
    CF1        CF2 CF3                          CFb                                          bigger T is, the smaller the tree will
    child1     child2 child3                    childb
                                                                                             be.
                                                                                             The CF tree is built dynamically as
                    Leaf node                                          Leaf node             data is scanned.
prev CF1 CF2            CFL next               prev CF1 CF2              CFL next

             T= Max. radius of a sub-cluster




CF Tree Insertion                                                                          Birch Clustering Algorithm
   Identifying the appropriate leaf: recursively                                             Phase 1: Scan all data and build an
   descending the CF tree and choosing the                                                   initial in-memory CF tree.
   closest child node according to a chosen
   distance metric                                                                           Phase 2: condense into desirable
   Modifying the leaf: test whether the leaf                                                 length by building a smaller CF tree.
   can absorb the node without violating the                                                 Phase 3: Global clustering
   threshold. If there is no room, split the                                                 Phase 4: Cluster refining – this is
   node
                                                                                             optional, and requires more passes
   Modifying the path: update CF information
                                                                                             over the data to refine the results
   up the path.




                                                                                                                                          3
Birch – Phase 1                                         Birch - Phase 2
     Start with initial threshold and insert points         Optional
     into the tree
     If run out of memory, increase threshold               Phase 3 sometime have minimum
     value, and rebuild a smaller tree by                   size which performs well, so phase 2
     reinserting values from older tree and then            prepares the tree for phase 3.
     other values
                                                            Removes outliers, and grouping
     Good initial threshold is important but hard
     to figure out                                          clusters.
     Outlier removal – when rebuilding tree
     remove outliers




Birch – Phase 3                                         Birch – Phase 4
     Problems after phase 1:                                Optional
         Input order affects results                        Additional scan/s of the dataset,
         Splitting triggered by node size                   attaching each item to the centroids
     Phase 3:                                               found.
         cluster all leaf nodes on the CF values            Recalculating the centroids and
         according to an existing algorithm                 redistributing the items.
         Algorithm used here: agglomerative                 Always converges
         hierarchical clustering




Clustering example                                                      Clustering example
                                                      band224
                                                                                     K-means Clustering
                                                                                        to 5 classes


Pixel classification in images
From top to bottom:
  BIRCH classification
  Visible wavelength band
  Near-infrared band


                                                                band2

                                                                         band1




                                                                                                          4
Conclusions                              Discussion
 Birch performs faster than then          After reading the two papers for data
 existing algorithms on large datasets    mining, what do you think is the
                                          criteria to say if a data mining
 Scans whole data only once               algorithm is “good”?
 Handles outliers                           Efficiency?
                                            I/O cost?
                                            Memory/disk requirement?
                                            Stability?
                                            Immunity to abnormal data?




 Thanks for listening




                                                                                  5

More Related Content

What's hot

Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning Project
Adeyemi Fowe
 

What's hot (20)

3.4 density and grid methods
3.4 density and grid methods3.4 density and grid methods
3.4 density and grid methods
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering Algorithm
 
Clique and sting
Clique and stingClique and sting
Clique and sting
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
 
Clique
Clique Clique
Clique
 
Grid based method & model based clustering method
Grid based method & model based clustering methodGrid based method & model based clustering method
Grid based method & model based clustering method
 
Data miningpresentation
Data miningpresentationData miningpresentation
Data miningpresentation
 
Chapter8
Chapter8Chapter8
Chapter8
 
Optics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structureOptics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structure
 
Chapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningChapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text mining
 
K-Means clustring @jax
K-Means clustring @jaxK-Means clustring @jax
K-Means clustring @jax
 
Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning Project
 
3.6 constraint based cluster analysis
3.6 constraint based cluster analysis3.6 constraint based cluster analysis
3.6 constraint based cluster analysis
 
K means and dbscan
K means and dbscanK means and dbscan
K means and dbscan
 
Kmeans
KmeansKmeans
Kmeans
 
Lect4
Lect4Lect4
Lect4
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithms
 
Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering
 
Data clustering
Data clustering Data clustering
Data clustering
 
An improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyAn improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracy
 

Similar to Birch

A comprehensive survey of contemporary
A comprehensive survey of contemporaryA comprehensive survey of contemporary
A comprehensive survey of contemporary
prjpublications
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
Fault tolerance in wireless sensor networks by Constrained Delaunay Triangula...
Fault tolerance in wireless sensor networks by Constrained Delaunay Triangula...Fault tolerance in wireless sensor networks by Constrained Delaunay Triangula...
Fault tolerance in wireless sensor networks by Constrained Delaunay Triangula...
Sigma web solutions pvt. ltd.
 

Similar to Birch (20)

Birch1
Birch1Birch1
Birch1
 
My8clst
My8clstMy8clst
My8clst
 
dm_clustering2.ppt
dm_clustering2.pptdm_clustering2.ppt
dm_clustering2.ppt
 
Community detection in social networks[1]
Community detection in social networks[1]Community detection in social networks[1]
Community detection in social networks[1]
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Clustering.pptx
Clustering.pptxClustering.pptx
Clustering.pptx
 
A comprehensive survey of contemporary
A comprehensive survey of contemporaryA comprehensive survey of contemporary
A comprehensive survey of contemporary
 
Lec4 Clustering
Lec4 ClusteringLec4 Clustering
Lec4 Clustering
 
Data Mining: Cluster Analysis
Data Mining: Cluster AnalysisData Mining: Cluster Analysis
Data Mining: Cluster Analysis
 
Tree Based Collaboration For Target Tracking
Tree Based Collaboration For Target TrackingTree Based Collaboration For Target Tracking
Tree Based Collaboration For Target Tracking
 
iiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdfiiit delhi unsupervised pdf.pdf
iiit delhi unsupervised pdf.pdf
 
Data Applied: Clustering
Data Applied: ClusteringData Applied: Clustering
Data Applied: Clustering
 
Data Applied: Clustering
Data Applied: ClusteringData Applied: Clustering
Data Applied: Clustering
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Clustering
ClusteringClustering
Clustering
 
Fault tolerance in wireless sensor networks by Constrained Delaunay Triangula...
Fault tolerance in wireless sensor networks by Constrained Delaunay Triangula...Fault tolerance in wireless sensor networks by Constrained Delaunay Triangula...
Fault tolerance in wireless sensor networks by Constrained Delaunay Triangula...
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
A Survey on Clustering Techniques for Wireless Sensor Network
A Survey on Clustering Techniques for Wireless Sensor Network A Survey on Clustering Techniques for Wireless Sensor Network
A Survey on Clustering Techniques for Wireless Sensor Network
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
 

Recently uploaded

Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 

Recently uploaded (20)

Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 

Birch

  • 1. Outline Birch: An efficient data clustering method for very large databases What is data clustering Data clustering applications Previous Approaches Tian Zhang, Raghu Ramakrishnan, Birch’s Goal Miron Livny Clustering Feature Birch clustering algorithm CPSC 504 Clustering example Presenter: Joel Lanir Discussion: Dan Li What is Data Clustering? Data Clustering Helps understand the natural A cluster is a closely-packed group. grouping or structure in a dataset A collection of data objects that are Large set of multidimensional data similar to one another and treated Data space is usually not uniformly collectively as a group. occupied Identify the sparse and crowded Data Clustering is the partitioning places of a dataset into clusters Helps visualization Discussion Some Clustering applications Can you give some examples for very Biology – building groups of genes with related patterns large databases? What applications Marketing – partition the population of can you imagine that require such consumers to market segments large databases for clustering? Division of WWW pages into genres. Image segmentations – for object What are the special requirements recognition that “large” databases pose on Land use – Identification of areas of similar clustering, or more general on data land use from satellite images mining? Insurance – Identify groups of policy holders with high average claim cost 1
  • 2. Data Clustering – previous approaches Approaches Distance Based (statistics) Must be a distance metric between two items probability based (Machine learning): assumes that all data points are in memory and make wrong assumption that can be scanned frequently distributions on attributes are Ignores the fact that not all data points are equally important independent on each other Close data points are not gathered together Probability representations of clusters Inspects all data points on multiple iterations is expensive These approaches do not deal with dataset and memory size issues! Clustering parameters Clustering parameters Centroid – Euclidian center Other measurements (like the Radius – average distance to center Euclidean distance of the centroids of two clusters) will measure how far Diameter – average pairwise away two clusters are. difference within a cluster A good quality clustering will produce Radius and diameter are measures of high intra-clustering and low inter- the tightness of a cluster around its clustering center. We wish to keep these low. A good quality clustering can help find hidden patterns Birch’s goals: Clustering Feature (CF) Minimize running time and data CF is a compact storage for data on scans, thus formulating the problem points in a cluster for large databases Has enough information to calculate Clustering decisions made without the intra-cluster distances scanning the whole data Additivity theorem allows us to merge Exploit the non uniformity of data – sub-clusters treat dense areas as one, and remove outliers (noise) 2
  • 3. Clustering Feature (CF) CF Additivity Theorem Given N d-dimensional data points in a If CF1 = (N1, LS1, SS1), and cluster: {Xi} where i = 1, 2, …, N, CF2 = (N2 ,LS2, SS2) are the CF entries of CF = (N, LS, SS) two disjoint subclusters. N is the number of data points in the cluster, The CF entry of the subcluster formed by LS is the linear sum of the N data points, merging the two disjoin subclusters is: SS is the square sum of the N data points. CF1 + CF2 = (N1 + N2 , LS1 + LS2, SS1 + SS2) B = Max. no. of CF in a non-leaf node CF Tree L = Max. no. of CF in a leaf node Root CF TREE CF1 CF2 CF3 CFb child1 child2 child3 childb T is the threshold for the diameter or radius of the leaf nodes Non-leaf node The tree size is a function of T. The CF1 CF2 CF3 CFb bigger T is, the smaller the tree will child1 child2 child3 childb be. The CF tree is built dynamically as Leaf node Leaf node data is scanned. prev CF1 CF2 CFL next prev CF1 CF2 CFL next T= Max. radius of a sub-cluster CF Tree Insertion Birch Clustering Algorithm Identifying the appropriate leaf: recursively Phase 1: Scan all data and build an descending the CF tree and choosing the initial in-memory CF tree. closest child node according to a chosen distance metric Phase 2: condense into desirable Modifying the leaf: test whether the leaf length by building a smaller CF tree. can absorb the node without violating the Phase 3: Global clustering threshold. If there is no room, split the Phase 4: Cluster refining – this is node optional, and requires more passes Modifying the path: update CF information over the data to refine the results up the path. 3
  • 4. Birch – Phase 1 Birch - Phase 2 Start with initial threshold and insert points Optional into the tree If run out of memory, increase threshold Phase 3 sometime have minimum value, and rebuild a smaller tree by size which performs well, so phase 2 reinserting values from older tree and then prepares the tree for phase 3. other values Removes outliers, and grouping Good initial threshold is important but hard to figure out clusters. Outlier removal – when rebuilding tree remove outliers Birch – Phase 3 Birch – Phase 4 Problems after phase 1: Optional Input order affects results Additional scan/s of the dataset, Splitting triggered by node size attaching each item to the centroids Phase 3: found. cluster all leaf nodes on the CF values Recalculating the centroids and according to an existing algorithm redistributing the items. Algorithm used here: agglomerative Always converges hierarchical clustering Clustering example Clustering example band224 K-means Clustering to 5 classes Pixel classification in images From top to bottom: BIRCH classification Visible wavelength band Near-infrared band band2 band1 4
  • 5. Conclusions Discussion Birch performs faster than then After reading the two papers for data existing algorithms on large datasets mining, what do you think is the criteria to say if a data mining Scans whole data only once algorithm is “good”? Handles outliers Efficiency? I/O cost? Memory/disk requirement? Stability? Immunity to abnormal data? Thanks for listening 5