Canopy clustering algorithm

It is an unsupervised pre-clustering algorithm performed before a K-means clustering or Hierarchical
clustering.

It is basically performed to speed up the clustering in the case of large data sets, in which a direct
implementation of the main algorithm may be impractical due to the size of the data set.

Algorithm

Start with a set/list of data points and two distance T1 > T2 for processing.

    1. Select any point (at random) from this list to form a canopy center.

    2. Approximate its distance to all other points in the list.

    3. Put all the points which fall within the distance threshold of T1 into a canopy.

    4. Remove from the (main/original) list all the points which fall within the threshold of T2. These
        points are excluded from being the center of and forming new canopies.

    5. Repeat from step 1 to 4 until the original list is empty.

For an exhaustive study please go through a paper by McCallum, Nigam and Ungar, located at
http://www.kamalnigam.com/papers/canopy-kdd00.pdf.

References:

   i.   Andrew McCallum, Kamal Nigam and Lyle H. Ungar, Efficient Clustering of High-Dimensional
        Data Sets with Application to Reference Matching
  ii.   https://cwiki.apache.org/MAHOUT/canopy-clustering.html
 iii.   http://en.wikipedia.org/wiki/Canopy_clustering_algorithm

Canopy clustering algorithm

  • 1.
    Canopy clustering algorithm Itis an unsupervised pre-clustering algorithm performed before a K-means clustering or Hierarchical clustering. It is basically performed to speed up the clustering in the case of large data sets, in which a direct implementation of the main algorithm may be impractical due to the size of the data set. Algorithm Start with a set/list of data points and two distance T1 > T2 for processing. 1. Select any point (at random) from this list to form a canopy center. 2. Approximate its distance to all other points in the list. 3. Put all the points which fall within the distance threshold of T1 into a canopy. 4. Remove from the (main/original) list all the points which fall within the threshold of T2. These points are excluded from being the center of and forming new canopies. 5. Repeat from step 1 to 4 until the original list is empty. For an exhaustive study please go through a paper by McCallum, Nigam and Ungar, located at http://www.kamalnigam.com/papers/canopy-kdd00.pdf. References: i. Andrew McCallum, Kamal Nigam and Lyle H. Ungar, Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching ii. https://cwiki.apache.org/MAHOUT/canopy-clustering.html iii. http://en.wikipedia.org/wiki/Canopy_clustering_algorithm