Clustering

06/03/15
Efficient Clustering Algorithms for
Out-of-Core, Distributed and
Streaming Data
Anjan Goswami
The Ohio State University

06/03/15
A World Immersed in Data!
Huge amount of data.
Distributed Data.
Streaming Data.
EXAMPLES
 Business Data
Wal-Mart (20M transaction per day)
AT&T (300M calls per day)
Mobil oil exploration data (100TB)
 Satellite and Sensor Data
NASA, EOS project: 50 GB per hour
Sloan Digital Sky Survey (SDSS) : 300M celestial objects, DR1
contains 1 TB of catalogs
 Scientific Simulation
BioSimGrid (10-40TB trajectory data)
 Biology Data
GenBank (>30 Billion base pairs, >30 Million sequences, 2003)
 Web logs  Amazon, Ebay, MSN

06/03/15
Some Challenges
Massive Data (out of core).
Distributed and Possibly imbalanced.
Streaming Data.

06/03/15
Overview of My Research
Out-of-Core Datasets
Implementation and Evaluation of Fast and
Exact KMeans Algorithm [ICDM’04]
Distributed Datasets
Distributed Fast and Exact KMeans Algorithm
[Under Submission]
Streaming Data
 Two exact algorithms and two approximate
algorithms for clustering evolving streaming
data.
[under preparation]

06/03/15
Overview of the Talk
K Center Clustering Definition
KMeans Algorithom
Out of Core Data: FEKM and its
Evaluation
Distributed Data: DFEKM
Streaming Data: Four Algorithms.

06/03/15
K Center Clustering Problem
Input: A set of data points.
Output: K centers such that sum squared
distance of each point to its closest center
is minimum.
This problem has been proved as NP-
Hard.

06/03/15
K-means Algorithm
Developed by Hartigan in 1967.
Proven to converge to local minima
Bottou/Bengio 1995.
Proven as a special case of Newton’s
stepest descent.
Cheng 1995.

06/03/15
Variants of Fast K-means Algorithms
• Pelleg and Moore: use of kd-tree [in
memory]
• Bradley and Fayyad: iterative sampling and
compression. [out of core, approx]
• Farnstorm simplified previous.
• Domingos and Hulten: Use Hoeffding
inequality for successive sampling. [sampling]

06/03/15
Roadmap of Presentation
 Clustering Algorithms.
- Out of Core data.
FEKM Evaluation of FEKM
- Distributed data.
- Streaming data.

06/03/15
Motivation to FEKM
Existing algorithms give approximate
cluster centers.
No exact clustering algorithms for out of
core data.

06/03/15
Sampling the datasets

06/03/15
K-means clustering on samplesConfidence Radius
An estimation of the upper-bound of the distance between the
sample center and the corresponding k-means center!

06/03/15
Boundary Points
d1
d2
C1,δ1
C2,δ2
|d1-d2|<δ1+δ2 Another center is close
enough compared to the closest center!

06/03/15
Cluster Abstract Table

06/03/15
Basic ideas of FEKM
Run KMeans on sample and build CA
table storing centers at each iteration.
Estimate confidence radius from the
sample.
In one scan classify the data points as
stable and boundary points for each row of
the table.

06/03/15
Basic Idea of FEKM
Store sufficient statistics of core points
and the boundary points.
 After one scan, recompute centers at
each row with the new information.

06/03/15
Basic Idea of FEKM
Verify if the centers within the confidence
radius.
If the center within confidence radius 
computed exact centers as KMeans
algorithm.
If the centers not within the confidence
radius  use these as new initial centers
and repeat the whole process.

06/03/15
Discussion
Correctness
FEKM guarantees to find the same
clusters as the original k-means
Performance Analysis
Determined by the number of passes of
the dataset

06/03/15
Experimental Setup and Datasets
Machines
700 MHz Pentium Processors and 1 GB memory
Synthetic Datasets:
Similar to the ones used by Bradley et al
18 1.1GB datasets and 2 4.4GB datasets
5, 10, 20, 50, 100, and 200 dimensions
5, 10 and 20 clusters
Real Datasets (UCI ML archive)
KDDCup99 (38 dim, 1.8 GB, k=5)
Corel image (32 dim, 1.9GB, k=16)
Reuters text database (258 dim, 2 GB, k=25)
Super-sampling, normalized [0,1]

06/03/15
Performance of k-means and FEKM
Algorithms on synthetic Datasets
Data No.
iterations
Time of
k-means
Time of
FEKM
Samples
(%)
Passes
Size Dimensions
1.1GB
200 100 54862.33 27388.85 10 2
200 3 1898.65 584.88 5 1
100 100 41029.15 18106.51 10 2
100 3 1233.12 585.63 5 1
50 3 1796.30 882.36 5 1
20 10 5335.15 2112.42 5 2
10 6 3919.08 1643.75 5 1
5 6 4619.95 2353.41 5 1
4.4GB 100 2 4393.02 2931.53 10 1
100 10 21985.62 8194.07 10 1
100 10 21985.62 7467.53 5 1Running Time in Seconds, 20 clusters

06/03/15
Performance of k-means and FEKM
Algorithms on Real Datasets
Data No.
iterations
Time of
k-means
Time of
FEKM
Samples
(%)
Passes Squared
Error
Kdd99 19 7151 2317 10 2 4.0
kdd99 19 7151 2529 15 2 3.5
kdd99 19 7151 2136 5 2 4.2
Corel 43 28442 10503 10 3 2.2
Corel 43 28442 12603 15 3 2.15
Corel 43 28442 9342 5 3 3.24
Reuter 20 41290 10311 10 2 10.1
Reuter 20 41290 11204 15 2 8.6
Reuter 20 41290 9214 5 2 14.9
Running Time in Seconds, Squared Error between final centers
and the centers after sampling

06/03/15
Summary of Results
Typically requires only one or a small number of
passes on the entire dataset
Provably produces the same cluster centers as
reported by the original k-means algorithm
Experimental results from a number of real and
synthetic datasets show speedups between a
factor of 2 and 4.5, as compared to k-means

06/03/15
Roadmap of Presentation
Clustering Algorithms.
- Out of Core data
FEKM Evaluation of FEKM
- Distributed data
- Streaming data

06/03/15
Distributed Clustering
Growing number of distributed data
repositories.
Downloading or Merging data from remote
sources not efficient.
Not sufficient work on Distributed Data
Mining algorithms.
Existing approximate Algorithms.

06/03/15
DFEKM Idea
Extend Concepts of FEKM.
Communicate CA Tables.
Use stable and boundary points and
confidence radius as in FEKM.

06/03/15
Distributed Fast and Exact KMeans
(DFEKM)
Client nodes
 send samples to the central node.
The central node 
 runs kmeans on the sample
 build the CAtable.
 send it back to all the nodes.
Client nodes 
 scans its data sets once
 keep suff. Stat. of the stable points
 keep the boundary points.
 Send CA Tables to central node.

06/03/15
Distributed Fast and Exact KMeans
(DFEKM)
Central Node 
 merges all CA Table.
 re-compute exact centers.
Central Node 
 Verify If the new centers are within
confidence radius.
 If not the central node sends new initial
centers to all nodes and the whole
process repeats.

06/03/15
Experiment Design
Performance when the nodes are loosely
coupled.
Performance in presence of imbalanced data.
Comparison with parallel KMeans.
Performance with varying number of nodes for
the above scenarios.
Varying number of clusters, data dimension.
 Synthetic and Real Data Sets.

06/03/15
Parallel KMeans
Central node send initial centers to all nodes.
At each pass  Each node scans its data,
collects the sufficient statistics (number, linear
sum and square sum of the data points).
The central node 
gathers the data.
merge the statistics from all the nodes.
Recompute Centers.
Sends the new centers to all other nodes.
Repeat until the centers do not move.

06/03/15
Parallel KMeans Pros and Cons
Efficient  scales linearly with the
number of nodes.
The performance deteriorate for loosely
coupled machines.
Problem comes with the imbalanced data.

06/03/15
Related Research
An implementation of parallel KMeans
using MPI by Dhillon and Modha.
This does not consider loosely coupled
environment or imbalance in data.
A new algorithm is required.

06/03/15
Remedy
Run parallel KMeans on remote nodes.
we can download the data in a central
node and can run KMeans algorithm.

06/03/15
Machines
IA-32 cluster 2 900 MHz Itanium-2 Processors
4 GB main memory
Myrinet Switch 2Gb/s
Datasets
Synthetic Datasets
Similar to the ones used by Bradley et al
4 1.1GB datasets.
10 and 100 dimensions
5 and 20 clusters
Real Datasets (UCI ML archive)
KDDCup99 (38 dim, 1.8 GB, k=5)
Corel image (32 dim, 1.9GB, k=16)
Super-sampling, normalized [0,1]

06/03/15
Comparison with Parallel KMeans with
increasing delay in communication
between nodes.

06/03/15
Comparisons with parallel KMeans in
presence of imbalanced data.

06/03/15
Summary of Results
Faster than parallel with higher delay in
communication.
Faster than parallel Kmeans when data is
highly imbalanced
(50%, 20%, 20% and 10%)
Faster than KMeans for single node
(sequential).
Parallel KMeans faster for 4 nodes with
delay less than 100s.

06/03/15
Challenges
Required online or preferably one pass
algorithm.
It should be able to deal with evolving
nature of streaming data.

06/03/15
Related Work
Motwani: One pass approximate
algorithm (Stream) for k median clustering
problem.
Motwani: Algorithm for maintaining
clusters in a sliding window.
J. Han: A framework to deal with evolving
streaming data.

06/03/15
Evolving Clusters in Sliding Window: Little
Movement of Centers

06/03/15
Evolving Clusters in Sliding Window:
Arbitrary Movements of Cluster Centers

06/03/15
Four Algorithms
FESK Fast and Exact Stream KMeans
FASK Fast and Approximate Stream
KMeans
EADSK Exact Adaptive Stream KMeans
AADSK Approximate Adaptive Stream
KMeans

06/03/15
Four Algorithms … Idea

06/03/15
Stream KMeans Concepts

06/03/15
Exact and Approximate Adaptive Stream
KMeans (EADSK and AADSK)
Run KMeans on the data in first window.
Estimate confidence radius, cluster
radius, threshold number of points at each
cluster.
In New Window: Find stable, boundary
and outlier points.

06/03/15
Keep sufficient statistics of stable
points.
 Keep the boundary and outlier
points.

06/03/15
If number of outliers > the threshold
 Case 1: Cluster Center Movement
is substantial.
Otherwise  Case 2: Movement is
within conf radius.

06/03/15
EADSK
Case 1: runs KMeans on the complete data set.
Case 2: runs a weighted KMeans on the saved
boundary and outlier points and using the
sufficient statistics of core points.
AADSK
Case 1: runs weighted KMeans with outlier and
boundary points.
Case 2: use stable and boundary points and
runs weighted KMeans.

06/03/15
Exact and Approximate Stream
KMeans (FESK and FASK)
These two algorithms do not use outliers.
At initial window  Estimate Confidence Radius.
At Next Window 
Finds Stable and Boundary Points.
Detects if the centers move outside the confidence
region.
If true  FESK runs KMeans on the complete data set.
FASK runs the weighted KMeans algorithm on stable
and boundary points.
If false  Both FESK and FASK have computed the
exact centers as KMeans

06/03/15
Experiments and Evaluations
Experiments with Evolving Clusters
On very large data sets.
On smaller data at each window.

06/03/15
Machines
700 MHz Pentium Processors
1 GB memory
Datasets
Synthetic Datasets
Similar to the ones used by Bradley et al
6 2.2 GB datasets and 2 small datasets
5, 20, and 100 dimensions
5 and 20 clusters
Real Datasets (UCI ML archive)
KDDCup99 (38 dim, 1.8 GB, k=5)
Corel image (32 dim, 1.9GB, k=16)
Super-sampling, normalized [0,1]

06/03/15
Comparison of KMean, FESK and FASK
for Very Large Data Set (2.2GB)

06/03/15
Comparison of Kmean with EADSK and
AADSK with Very Large Data Set (2.2GB)

06/03/15
Comparison of KMean, FESK and FASK with C5d20 dataset with
small number of points in a window and centers are moving within
Confidence Region

06/03/15
Comparison with EADSK and
AADSK with the c5d20 data set

06/03/15
Centers are moving outside the
confidence region in every window

06/03/15
Centers are moving outside confidence
region in every window

06/03/15
Characteristics of the Algorithms
All four algorithms perform better than multi-
pass KMeans in case when centers do not move
outside confidence region.
FESK and EADSK always find exact centers as
KMeans.
FESK and EADSK perform no better than
KMeans if centers move outside confidence
region for any window.
FASK and AADSK perform much faster than
KMeans for this case.
AADSK is able to compute better result than
FASK for evolving streaming data.

06/03/15
Conclusion
Efficient variant of KMeans algorithm for
out of core, distributed and streaming
data.
Evolving nature of streaming data is
considered.
Streaming data clustering is integrated
with change detection.

Clustering

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Clustering

Similar to Clustering (20)

More from Anjan Goswami

More from Anjan Goswami (8)

Recently uploaded

Recently uploaded (20)

Clustering