Clustering has been one of the most widely studied topics
in data mining and k-means clustering has been one of
the popular clustering algorithms. K-means requires several
passes on the entire dataset, which can make it very expensive
for large disk-resident datasets. In view of this, a lot of work
has been done on various approximate versions of k-means,
which require only one or a small number of passes on the
entire dataset.
In this paper, we present a new algorithm, called Fast and
Exact K-means Clustering (FEKM), which typically requires
only one or a small number of passes on the entire dataset,
and provably produces the same cluster centers as reported
by the original k-means algorithm. The algorithm uses sampling
to create initial cluster centers, and then takes one or
more passes over the entire dataset to adjust these cluster
centers. We provide theoretical analysis to show that the cluster
centers thus reported are the same as the ones computed
by the original k-means algorithm. Experimental results from
a number of real and synthetic datasets show speedup between
a factor of 2 and 4.5, as compared to k-means.
This paper also describes and evaluates a distributed version
of FEKM, which we refer to as DFEKM. This algorithm
is suitable for analyzing data that is distributed across loosely
coupled machines. Unlike the previous work in this area,
DFEKM provably produces the same results as the original
k-means algorithm. Our experimental results show that
DFEKM is clearly better than two other possible options for
exact clustering on distributed data, which are down-loading
all data and running sequential k-means, or running parallel
k-means on a loosely coupled configuration. Moreover, even
in a tightly coupled environment, DFEKM can outperform
parallel k-means if there is a significant load imbalance
2. 06/03/15
A World Immersed in Data!
Huge amount of data.
Distributed Data.
Streaming Data.
EXAMPLES
Business Data
Wal-Mart (20M transaction per day)
AT&T (300M calls per day)
Mobil oil exploration data (100TB)
Satellite and Sensor Data
NASA, EOS project: 50 GB per hour
Sloan Digital Sky Survey (SDSS) : 300M celestial objects, DR1
contains 1 TB of catalogs
Scientific Simulation
BioSimGrid (10-40TB trajectory data)
Biology Data
GenBank (>30 Billion base pairs, >30 Million sequences, 2003)
Web logs Amazon, Ebay, MSN
4. 06/03/15
Overview of My Research
Out-of-Core Datasets
Implementation and Evaluation of Fast and
Exact KMeans Algorithm [ICDM’04]
Distributed Datasets
Distributed Fast and Exact KMeans Algorithm
[Under Submission]
Streaming Data
Two exact algorithms and two approximate
algorithms for clustering evolving streaming
data.
[under preparation]
5. 06/03/15
Overview of the Talk
K Center Clustering Definition
KMeans Algorithom
Out of Core Data: FEKM and its
Evaluation
Distributed Data: DFEKM
Streaming Data: Four Algorithms.
6. 06/03/15
K Center Clustering Problem
Input: A set of data points.
Output: K centers such that sum squared
distance of each point to its closest center
is minimum.
This problem has been proved as NP-
Hard.
7. 06/03/15
K-means Algorithm
Developed by Hartigan in 1967.
Proven to converge to local minima
Bottou/Bengio 1995.
Proven as a special case of Newton’s
stepest descent.
Cheng 1995.
8. 06/03/15
Variants of Fast K-means Algorithms
• Pelleg and Moore: use of kd-tree [in
memory]
• Bradley and Fayyad: iterative sampling and
compression. [out of core, approx]
• Farnstorm simplified previous.
• Domingos and Hulten: Use Hoeffding
inequality for successive sampling. [sampling]
9. 06/03/15
Roadmap of Presentation
Clustering Algorithms.
- Out of Core data.
FEKM Evaluation of FEKM
- Distributed data.
- Streaming data.
13. 06/03/15
K-means clustering on samplesConfidence Radius
An estimation of the upper-bound of the distance between the
sample center and the corresponding k-means center!
18. 06/03/15
Basic ideas of FEKM
Run KMeans on sample and build CA
table storing centers at each iteration.
Estimate confidence radius from the
sample.
In one scan classify the data points as
stable and boundary points for each row of
the table.
19. 06/03/15
Basic Idea of FEKM
Store sufficient statistics of core points
and the boundary points.
After one scan, recompute centers at
each row with the new information.
20. 06/03/15
Basic Idea of FEKM
Verify if the centers within the confidence
radius.
If the center within confidence radius
computed exact centers as KMeans
algorithm.
If the centers not within the confidence
radius use these as new initial centers
and repeat the whole process.
22. 06/03/15
Experimental Setup and Datasets
Machines
700 MHz Pentium Processors and 1 GB memory
Synthetic Datasets:
Similar to the ones used by Bradley et al
18 1.1GB datasets and 2 4.4GB datasets
5, 10, 20, 50, 100, and 200 dimensions
5, 10 and 20 clusters
Real Datasets (UCI ML archive)
KDDCup99 (38 dim, 1.8 GB, k=5)
Corel image (32 dim, 1.9GB, k=16)
Reuters text database (258 dim, 2 GB, k=25)
Super-sampling, normalized [0,1]
23. 06/03/15
Performance of k-means and FEKM
Algorithms on synthetic Datasets
Data No.
iterations
Time of
k-means
Time of
FEKM
Samples
(%)
Passes
Size Dimensions
1.1GB
200 100 54862.33 27388.85 10 2
200 3 1898.65 584.88 5 1
100 100 41029.15 18106.51 10 2
100 3 1233.12 585.63 5 1
50 3 1796.30 882.36 5 1
20 10 5335.15 2112.42 5 2
10 6 3919.08 1643.75 5 1
5 6 4619.95 2353.41 5 1
4.4GB 100 2 4393.02 2931.53 10 1
100 10 21985.62 8194.07 10 1
100 10 21985.62 7467.53 5 1Running Time in Seconds, 20 clusters
24. 06/03/15
Performance of k-means and FEKM
Algorithms on Real Datasets
Data No.
iterations
Time of
k-means
Time of
FEKM
Samples
(%)
Passes Squared
Error
Kdd99 19 7151 2317 10 2 4.0
kdd99 19 7151 2529 15 2 3.5
kdd99 19 7151 2136 5 2 4.2
Corel 43 28442 10503 10 3 2.2
Corel 43 28442 12603 15 3 2.15
Corel 43 28442 9342 5 3 3.24
Reuter 20 41290 10311 10 2 10.1
Reuter 20 41290 11204 15 2 8.6
Reuter 20 41290 9214 5 2 14.9
Running Time in Seconds, Squared Error between final centers
and the centers after sampling
25. 06/03/15
Summary of Results
Typically requires only one or a small number of
passes on the entire dataset
Provably produces the same cluster centers as
reported by the original k-means algorithm
Experimental results from a number of real and
synthetic datasets show speedups between a
factor of 2 and 4.5, as compared to k-means
27. 06/03/15
Distributed Clustering
Growing number of distributed data
repositories.
Downloading or Merging data from remote
sources not efficient.
Not sufficient work on Distributed Data
Mining algorithms.
Existing approximate Algorithms.
30. 06/03/15
Distributed Fast and Exact KMeans
(DFEKM)
Client nodes
send samples to the central node.
The central node
runs kmeans on the sample
build the CAtable.
send it back to all the nodes.
Client nodes
scans its data sets once
keep suff. Stat. of the stable points
keep the boundary points.
Send CA Tables to central node.
31. 06/03/15
Distributed Fast and Exact KMeans
(DFEKM)
Central Node
merges all CA Table.
re-compute exact centers.
Central Node
Verify If the new centers are within
confidence radius.
If not the central node sends new initial
centers to all nodes and the whole
process repeats.
32. 06/03/15
Experiment Design
Performance when the nodes are loosely
coupled.
Performance in presence of imbalanced data.
Comparison with parallel KMeans.
Performance with varying number of nodes for
the above scenarios.
Varying number of clusters, data dimension.
Synthetic and Real Data Sets.
33. 06/03/15
Parallel KMeans
Central node send initial centers to all nodes.
At each pass Each node scans its data,
collects the sufficient statistics (number, linear
sum and square sum of the data points).
The central node
gathers the data.
merge the statistics from all the nodes.
Recompute Centers.
Sends the new centers to all other nodes.
Repeat until the centers do not move.
34. 06/03/15
Parallel KMeans Pros and Cons
Efficient scales linearly with the
number of nodes.
The performance deteriorate for loosely
coupled machines.
Problem comes with the imbalanced data.
35. 06/03/15
Related Research
An implementation of parallel KMeans
using MPI by Dhillon and Modha.
This does not consider loosely coupled
environment or imbalance in data.
A new algorithm is required.
40. 06/03/15
Summary of Results
Faster than parallel with higher delay in
communication.
Faster than parallel Kmeans when data is
highly imbalanced
(50%, 20%, 20% and 10%)
Faster than KMeans for single node
(sequential).
Parallel KMeans faster for 4 nodes with
delay less than 100s.
43. 06/03/15
Related Work
Motwani: One pass approximate
algorithm (Stream) for k median clustering
problem.
Motwani: Algorithm for maintaining
clusters in a sliding window.
J. Han: A framework to deal with evolving
streaming data.
51. 06/03/15
Exact and Approximate Adaptive Stream
KMeans (EADSK and AADSK)
Run KMeans on the data in first window.
Estimate confidence radius, cluster
radius, threshold number of points at each
cluster.
In New Window: Find stable, boundary
and outlier points.
52. 06/03/15
Exact and Approximate Adaptive Stream
KMeans (EADSK and AADSK)
Keep sufficient statistics of stable
points.
Keep the boundary and outlier
points.
53. 06/03/15
Exact and Approximate Adaptive Stream
KMeans (EADSK and AADSK)
If number of outliers > the threshold
Case 1: Cluster Center Movement
is substantial.
Otherwise Case 2: Movement is
within conf radius.
54. 06/03/15
Exact and Approximate Adaptive Stream
KMeans (EADSK and AADSK)
EADSK
Case 1: runs KMeans on the complete data set.
Case 2: runs a weighted KMeans on the saved
boundary and outlier points and using the
sufficient statistics of core points.
AADSK
Case 1: runs weighted KMeans with outlier and
boundary points.
Case 2: use stable and boundary points and
runs weighted KMeans.
57. 06/03/15
Exact and Approximate Stream
KMeans (FESK and FASK)
These two algorithms do not use outliers.
At initial window Estimate Confidence Radius.
At Next Window
Finds Stable and Boundary Points.
Detects if the centers move outside the confidence
region.
If true FESK runs KMeans on the complete data set.
FASK runs the weighted KMeans algorithm on stable
and boundary points.
If false Both FESK and FASK have computed the
exact centers as KMeans
62. 06/03/15
Comparison of KMean, FESK and FASK with C5d20 dataset with
small number of points in a window and centers are moving within
Confidence Region
66. 06/03/15
Characteristics of the Algorithms
All four algorithms perform better than multi-
pass KMeans in case when centers do not move
outside confidence region.
FESK and EADSK always find exact centers as
KMeans.
FESK and EADSK perform no better than
KMeans if centers move outside confidence
region for any window.
FASK and AADSK perform much faster than
KMeans for this case.
AADSK is able to compute better result than
FASK for evolving streaming data.
67. 06/03/15
Conclusion
Efficient variant of KMeans algorithm for
out of core, distributed and streaming
data.
Evolving nature of streaming data is
considered.
Streaming data clustering is integrated
with change detection.