Recent algorithmic developments  have enabled dramatic improvements in performance for clustering applications. Previously, the workhorse clustering algorithm was k-means which scaled linearly with the desired number of clusters times the data size times the number of iterations required. The number of iterations itself depended on the number of clusters and in map-reduce implementations such as in Mahout , the required iterative implementation is exceedingly painful.
These new algorithms require only a single pass over the data and each pass has a cost that is roughly O(log k) where k is the desired number of clusters. The resulting implementation  which is being ported into Mahout has demonstrated some stunning speed. In one test, a uni-processor threaded implementation demonstrated the ability to cluster data points in just 20 micro-seconds per data point. Moreover, this algorithm is easily ported to map-reduce with essentially perfect linear scaling. This implies we should be able to cluster hundreds of millions of data points in minutes on moderate sized cluster. Even more exciting, these algorithms are online algorithms, so it is possible to build a real-time clustering engine that clusters data points as they arrive and never needs to look back at old data.
I will talk about the basic intuitions behind these algorithms, how they are implemented, their limitations and how to use them. I will also talk about some of the very exciting practical implications of having a super-fast clustering algorithm.