Canopy clustering algorithm

Canopy clustering algorithm

It is an unsupervised pre-clustering algorithm performed before a K-means clustering or Hierarchical
clustering.

It is basically performed to speed up the clustering in the case of large data sets, in which a direct
implementation of the main algorithm may be impractical due to the size of the data set.

Algorithm

Start with a set/list of data points and two distance T1 > T2 for processing.

1. Select any point (at random) from this list to form a canopy center.

2. Approximate its distance to all other points in the list.

3. Put all the points which fall within the distance threshold of T1 into a canopy.

4. Remove from the (main/original) list all the points which fall within the threshold of T2. These
points are excluded from being the center of and forming new canopies.

5. Repeat from step 1 to 4 until the original list is empty.

For an exhaustive study please go through a paper by McCallum, Nigam and Ungar, located at
http://www.kamalnigam.com/papers/canopy-kdd00.pdf.

References:

i. Andrew McCallum, Kamal Nigam and Lyle H. Ungar, Efficient Clustering of High-Dimensional
Data Sets with Application to Reference Matching
ii. https://cwiki.apache.org/MAHOUT/canopy-clustering.html
iii. http://en.wikipedia.org/wiki/Canopy_clustering_algorithm

Canopy clustering algorithm

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Canopy clustering algorithm

Similar to Canopy clustering algorithm (14)

Recently uploaded

Recently uploaded (20)

Canopy clustering algorithm