1. Canopy clustering algorithm
It is an unsupervised pre-clustering algorithm performed before a K-means clustering or Hierarchical
clustering.
It is basically performed to speed up the clustering in the case of large data sets, in which a direct
implementation of the main algorithm may be impractical due to the size of the data set.
Algorithm
Start with a set/list of data points and two distance T1 > T2 for processing.
1. Select any point (at random) from this list to form a canopy center.
2. Approximate its distance to all other points in the list.
3. Put all the points which fall within the distance threshold of T1 into a canopy.
4. Remove from the (main/original) list all the points which fall within the threshold of T2. These
points are excluded from being the center of and forming new canopies.
5. Repeat from step 1 to 4 until the original list is empty.
For an exhaustive study please go through a paper by McCallum, Nigam and Ungar, located at
http://www.kamalnigam.com/papers/canopy-kdd00.pdf.
References:
i. Andrew McCallum, Kamal Nigam and Lyle H. Ungar, Efficient Clustering of High-Dimensional
Data Sets with Application to Reference Matching
ii. https://cwiki.apache.org/MAHOUT/canopy-clustering.html
iii. http://en.wikipedia.org/wiki/Canopy_clustering_algorithm