SlideShare a Scribd company logo
1 of 68
06/03/15
Efficient Clustering Algorithms for
Out-of-Core, Distributed and
Streaming Data
Anjan Goswami
The Ohio State University
06/03/15
A World Immersed in Data!
Huge amount of data.
Distributed Data.
Streaming Data.
EXAMPLES
 Business Data
Wal-Mart (20M transaction per day)
AT&T (300M calls per day)
Mobil oil exploration data (100TB)
 Satellite and Sensor Data
NASA, EOS project: 50 GB per hour
Sloan Digital Sky Survey (SDSS) : 300M celestial objects, DR1
contains 1 TB of catalogs
 Scientific Simulation
BioSimGrid (10-40TB trajectory data)
 Biology Data
GenBank (>30 Billion base pairs, >30 Million sequences, 2003)
 Web logs  Amazon, Ebay, MSN
06/03/15
Some Challenges
Massive Data (out of core).
Distributed and Possibly imbalanced.
Streaming Data.
06/03/15
Overview of My Research
Out-of-Core Datasets
Implementation and Evaluation of Fast and
Exact KMeans Algorithm [ICDM’04]
Distributed Datasets
Distributed Fast and Exact KMeans Algorithm
[Under Submission]
Streaming Data
 Two exact algorithms and two approximate
algorithms for clustering evolving streaming
data.
[under preparation]
06/03/15
Overview of the Talk
K Center Clustering Definition
KMeans Algorithom
Out of Core Data: FEKM and its
Evaluation
Distributed Data: DFEKM
Streaming Data: Four Algorithms.
06/03/15
K Center Clustering Problem
Input: A set of data points.
Output: K centers such that sum squared
distance of each point to its closest center
is minimum.
This problem has been proved as NP-
Hard.
06/03/15
K-means Algorithm
Developed by Hartigan in 1967.
Proven to converge to local minima
Bottou/Bengio 1995.
Proven as a special case of Newton’s
stepest descent.
Cheng 1995.
06/03/15
Variants of Fast K-means Algorithms
• Pelleg and Moore: use of kd-tree [in
memory]
• Bradley and Fayyad: iterative sampling and
compression. [out of core, approx]
• Farnstorm simplified previous.
• Domingos and Hulten: Use Hoeffding
inequality for successive sampling. [sampling]
06/03/15
Roadmap of Presentation
 Clustering Algorithms.
- Out of Core data.
FEKM Evaluation of FEKM
- Distributed data.
- Streaming data.
06/03/15
Motivation to FEKM
Existing algorithms give approximate
cluster centers.
No exact clustering algorithms for out of
core data.
06/03/15
FEKM Concepts
06/03/15
Sampling the datasets
06/03/15
K-means clustering on samplesConfidence Radius
An estimation of the upper-bound of the distance between the
sample center and the corresponding k-means center!
06/03/15
Boundary Points
d1
d2
C1,δ1
C2,δ2
|d1-d2|<δ1+δ2 Another center is close
enough compared to the closest center!
06/03/15
Cluster Abstract Table
06/03/15
FEKM Basic Idea
06/03/15
FEKM Basic Idea
06/03/15
Basic ideas of FEKM
Run KMeans on sample and build CA
table storing centers at each iteration.
Estimate confidence radius from the
sample.
In one scan classify the data points as
stable and boundary points for each row of
the table.
06/03/15
Basic Idea of FEKM
Store sufficient statistics of core points
and the boundary points.
 After one scan, recompute centers at
each row with the new information.
06/03/15
Basic Idea of FEKM
Verify if the centers within the confidence
radius.
If the center within confidence radius 
computed exact centers as KMeans
algorithm.
If the centers not within the confidence
radius  use these as new initial centers
and repeat the whole process.
06/03/15
Discussion
Correctness
FEKM guarantees to find the same
clusters as the original k-means
Performance Analysis
Determined by the number of passes of
the dataset
06/03/15
Experimental Setup and Datasets
Machines
700 MHz Pentium Processors and 1 GB memory
Synthetic Datasets:
Similar to the ones used by Bradley et al
18 1.1GB datasets and 2 4.4GB datasets
5, 10, 20, 50, 100, and 200 dimensions
5, 10 and 20 clusters
Real Datasets (UCI ML archive)
KDDCup99 (38 dim, 1.8 GB, k=5)
Corel image (32 dim, 1.9GB, k=16)
Reuters text database (258 dim, 2 GB, k=25)
Super-sampling, normalized [0,1]
06/03/15
Performance of k-means and FEKM
Algorithms on synthetic Datasets
Data No.
iterations
Time of
k-means
Time of
FEKM
Samples
(%)
Passes
Size Dimensions
1.1GB
200 100 54862.33 27388.85 10 2
200 3 1898.65 584.88 5 1
100 100 41029.15 18106.51 10 2
100 3 1233.12 585.63 5 1
50 3 1796.30 882.36 5 1
20 10 5335.15 2112.42 5 2
10 6 3919.08 1643.75 5 1
5 6 4619.95 2353.41 5 1
4.4GB 100 2 4393.02 2931.53 10 1
100 10 21985.62 8194.07 10 1
100 10 21985.62 7467.53 5 1Running Time in Seconds, 20 clusters
06/03/15
Performance of k-means and FEKM
Algorithms on Real Datasets
Data No.
iterations
Time of
k-means
Time of
FEKM
Samples
(%)
Passes Squared
Error
Kdd99 19 7151 2317 10 2 4.0
kdd99 19 7151 2529 15 2 3.5
kdd99 19 7151 2136 5 2 4.2
Corel 43 28442 10503 10 3 2.2
Corel 43 28442 12603 15 3 2.15
Corel 43 28442 9342 5 3 3.24
Reuter 20 41290 10311 10 2 10.1
Reuter 20 41290 11204 15 2 8.6
Reuter 20 41290 9214 5 2 14.9
Running Time in Seconds, Squared Error between final centers
and the centers after sampling
06/03/15
Summary of Results
Typically requires only one or a small number of
passes on the entire dataset
Provably produces the same cluster centers as
reported by the original k-means algorithm
Experimental results from a number of real and
synthetic datasets show speedups between a
factor of 2 and 4.5, as compared to k-means
06/03/15
Roadmap of Presentation
Clustering Algorithms.
- Out of Core data
FEKM Evaluation of FEKM
- Distributed data
- Streaming data
06/03/15
Distributed Clustering
Growing number of distributed data
repositories.
Downloading or Merging data from remote
sources not efficient.
Not sufficient work on Distributed Data
Mining algorithms.
Existing approximate Algorithms.
06/03/15
DFEKM Idea
Extend Concepts of FEKM.
Communicate CA Tables.
Use stable and boundary points and
confidence radius as in FEKM.
06/03/15
DFEKM Scheme
06/03/15
Distributed Fast and Exact KMeans
(DFEKM)
Client nodes
 send samples to the central node.
The central node 
 runs kmeans on the sample
 build the CAtable.
 send it back to all the nodes.
Client nodes 
 scans its data sets once
 keep suff. Stat. of the stable points
 keep the boundary points.
 Send CA Tables to central node.
06/03/15
Distributed Fast and Exact KMeans
(DFEKM)
Central Node 
 merges all CA Table.
 re-compute exact centers.
Central Node 
 Verify If the new centers are within
confidence radius.
 If not the central node sends new initial
centers to all nodes and the whole
process repeats.
06/03/15
Experiment Design
Performance when the nodes are loosely
coupled.
Performance in presence of imbalanced data.
Comparison with parallel KMeans.
Performance with varying number of nodes for
the above scenarios.
Varying number of clusters, data dimension.
 Synthetic and Real Data Sets.
06/03/15
Parallel KMeans
Central node send initial centers to all nodes.
At each pass  Each node scans its data,
collects the sufficient statistics (number, linear
sum and square sum of the data points).
The central node 
gathers the data.
merge the statistics from all the nodes.
Recompute Centers.
Sends the new centers to all other nodes.
Repeat until the centers do not move.
06/03/15
Parallel KMeans Pros and Cons
Efficient  scales linearly with the
number of nodes.
The performance deteriorate for loosely
coupled machines.
Problem comes with the imbalanced data.
06/03/15
Related Research
An implementation of parallel KMeans
using MPI by Dhillon and Modha.
This does not consider loosely coupled
environment or imbalance in data.
A new algorithm is required.
06/03/15
Remedy
Run parallel KMeans on remote nodes.
we can download the data in a central
node and can run KMeans algorithm.
06/03/15
Experimental Setup and Datasets
Machines
IA-32 cluster 2 900 MHz Itanium-2 Processors
4 GB main memory
Myrinet Switch 2Gb/s
Datasets
Synthetic Datasets
Similar to the ones used by Bradley et al
4 1.1GB datasets.
10 and 100 dimensions
5 and 20 clusters
Real Datasets (UCI ML archive)
KDDCup99 (38 dim, 1.8 GB, k=5)
Corel image (32 dim, 1.9GB, k=16)
Super-sampling, normalized [0,1]
06/03/15
Comparison with Parallel KMeans with
increasing delay in communication
between nodes.
06/03/15
Comparisons with parallel KMeans in
presence of imbalanced data.
06/03/15
Summary of Results
Faster than parallel with higher delay in
communication.
Faster than parallel Kmeans when data is
highly imbalanced
(50%, 20%, 20% and 10%)
Faster than KMeans for single node
(sequential).
Parallel KMeans faster for 4 nodes with
delay less than 100s.
06/03/15
Roadmap of Presentation
Clustering Algorithms.
- Out of Core data
FEKM Evaluation of FEKM
- Distributed data
- Streaming data
06/03/15
Challenges
Required online or preferably one pass
algorithm.
It should be able to deal with evolving
nature of streaming data.
06/03/15
Related Work
Motwani: One pass approximate
algorithm (Stream) for k median clustering
problem.
Motwani: Algorithm for maintaining
clusters in a sliding window.
J. Han: A framework to deal with evolving
streaming data.
06/03/15
Evolving Clusters in Sliding Window: Little
Movement of Centers
06/03/15
Evolving Clusters in Sliding Window:
Arbitrary Movements of Cluster Centers
06/03/15
Four Algorithms
FESK Fast and Exact Stream KMeans
FASK Fast and Approximate Stream
KMeans
EADSK Exact Adaptive Stream KMeans
AADSK Approximate Adaptive Stream
KMeans
06/03/15
Four Algorithms … Idea
06/03/15
Stream KMeans Concepts
06/03/15
EADSK and AADSK
06/03/15
EADSK and AADSK
06/03/15
Exact and Approximate Adaptive Stream
KMeans (EADSK and AADSK)
Run KMeans on the data in first window.
Estimate confidence radius, cluster
radius, threshold number of points at each
cluster.
In New Window: Find stable, boundary
and outlier points.
06/03/15
Exact and Approximate Adaptive Stream
KMeans (EADSK and AADSK)
Keep sufficient statistics of stable
points.
 Keep the boundary and outlier
points.
06/03/15
Exact and Approximate Adaptive Stream
KMeans (EADSK and AADSK)
If number of outliers > the threshold
 Case 1: Cluster Center Movement
is substantial.
Otherwise  Case 2: Movement is
within conf radius.
06/03/15
Exact and Approximate Adaptive Stream
KMeans (EADSK and AADSK)
EADSK
Case 1: runs KMeans on the complete data set.
Case 2: runs a weighted KMeans on the saved
boundary and outlier points and using the
sufficient statistics of core points.
AADSK
Case 1: runs weighted KMeans with outlier and
boundary points.
Case 2: use stable and boundary points and
runs weighted KMeans.
06/03/15
FESK and FASK
06/03/15
FESK and FASK
06/03/15
Exact and Approximate Stream
KMeans (FESK and FASK)
These two algorithms do not use outliers.
At initial window  Estimate Confidence Radius.
At Next Window 
Finds Stable and Boundary Points.
Detects if the centers move outside the confidence
region.
If true  FESK runs KMeans on the complete data set.
FASK runs the weighted KMeans algorithm on stable
and boundary points.
If false  Both FESK and FASK have computed the
exact centers as KMeans
06/03/15
Experiments and Evaluations
Experiments with Evolving Clusters
On very large data sets.
On smaller data at each window.
06/03/15
Experimental Setup and Datasets
Machines
700 MHz Pentium Processors
1 GB memory
Datasets
Synthetic Datasets
Similar to the ones used by Bradley et al
6 2.2 GB datasets and 2 small datasets
5, 20, and 100 dimensions
5 and 20 clusters
Real Datasets (UCI ML archive)
KDDCup99 (38 dim, 1.8 GB, k=5)
Corel image (32 dim, 1.9GB, k=16)
Super-sampling, normalized [0,1]
06/03/15
Comparison of KMean, FESK and FASK
for Very Large Data Set (2.2GB)
06/03/15
Comparison of Kmean with EADSK and
AADSK with Very Large Data Set (2.2GB)
06/03/15
Comparison of KMean, FESK and FASK with C5d20 dataset with
small number of points in a window and centers are moving within
Confidence Region
06/03/15
Comparison with EADSK and
AADSK with the c5d20 data set
06/03/15
Centers are moving outside the
confidence region in every window
06/03/15
Centers are moving outside confidence
region in every window
06/03/15
Characteristics of the Algorithms
All four algorithms perform better than multi-
pass KMeans in case when centers do not move
outside confidence region.
FESK and EADSK always find exact centers as
KMeans.
FESK and EADSK perform no better than
KMeans if centers move outside confidence
region for any window.
FASK and AADSK perform much faster than
KMeans for this case.
AADSK is able to compute better result than
FASK for evolving streaming data.
06/03/15
Conclusion
Efficient variant of KMeans algorithm for
out of core, distributed and streaming
data.
Evolving nature of streaming data is
considered.
Streaming data clustering is integrated
with change detection.
06/03/15
Questions ???

More Related Content

What's hot

Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkFlink Forward
 
Hopsworks - ExtremeEarth Open Workshop
Hopsworks - ExtremeEarth Open WorkshopHopsworks - ExtremeEarth Open Workshop
Hopsworks - ExtremeEarth Open WorkshopExtremeEarth
 
Mining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert BifetMining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert BifetJ On The Beach
 
Project Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster ReliefProject Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster ReliefRobert Grossman
 
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...IRJET Journal
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analyticsAvinash Pandu
 
Optimization for iterative queries on Mapreduce
Optimization for iterative queries on MapreduceOptimization for iterative queries on Mapreduce
Optimization for iterative queries on Mapreducemakoto onizuka
 
Stacks
StacksStacks
StacksAcad
 
CourboSpark: Decision Tree for Time-series on Spark
CourboSpark: Decision Tree for Time-series on SparkCourboSpark: Decision Tree for Time-series on Spark
CourboSpark: Decision Tree for Time-series on SparkDataWorks Summit
 
ExtremeEarth Open Workshop - Overview and Achievements
ExtremeEarth Open Workshop - Overview and AchievementsExtremeEarth Open Workshop - Overview and Achievements
ExtremeEarth Open Workshop - Overview and AchievementsExtremeEarth
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analyticsAvinash Pandu
 
Bioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9pBioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9pRobert Grossman
 
High Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersSaliya Ekanayake
 
Reduce Side Joins
Reduce Side Joins Reduce Side Joins
Reduce Side Joins Edureka!
 
Relational Algebra and MapReduce
Relational Algebra and MapReduceRelational Algebra and MapReduce
Relational Algebra and MapReducePietro Michiardi
 
Slide 1
Slide 1Slide 1
Slide 1butest
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataRobert Grossman
 

What's hot (20)

Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
 
Hopsworks - ExtremeEarth Open Workshop
Hopsworks - ExtremeEarth Open WorkshopHopsworks - ExtremeEarth Open Workshop
Hopsworks - ExtremeEarth Open Workshop
 
Mining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert BifetMining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert Bifet
 
CMPE275-Project1Report
CMPE275-Project1ReportCMPE275-Project1Report
CMPE275-Project1Report
 
Project Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster ReliefProject Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster Relief
 
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analytics
 
CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...
CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...
CLIM Program: Remote Sensing Workshop, High Performance Computing and Spatial...
 
Optimization for iterative queries on Mapreduce
Optimization for iterative queries on MapreduceOptimization for iterative queries on Mapreduce
Optimization for iterative queries on Mapreduce
 
Stacks
StacksStacks
Stacks
 
CourboSpark: Decision Tree for Time-series on Spark
CourboSpark: Decision Tree for Time-series on SparkCourboSpark: Decision Tree for Time-series on Spark
CourboSpark: Decision Tree for Time-series on Spark
 
ExtremeEarth Open Workshop - Overview and Achievements
ExtremeEarth Open Workshop - Overview and AchievementsExtremeEarth Open Workshop - Overview and Achievements
ExtremeEarth Open Workshop - Overview and Achievements
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
Bioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9pBioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9p
 
High Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC Clusters
 
Reduce Side Joins
Reduce Side Joins Reduce Side Joins
Reduce Side Joins
 
Relational Algebra and MapReduce
Relational Algebra and MapReduceRelational Algebra and MapReduce
Relational Algebra and MapReduce
 
cnsm2011_slide
cnsm2011_slidecnsm2011_slide
cnsm2011_slide
 
Slide 1
Slide 1Slide 1
Slide 1
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
 

Similar to Clustering

A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelJenny Liu
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchGreg Makowski
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETScsandit
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale SupercomputerSagar Dolas
 
Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...NECST Lab @ Politecnico di Milano
 
Efficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceEfficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceSpiros Economakis
 
Efficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceEfficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceSpiros Oikonomakis
 
Clustering
ClusteringClustering
ClusteringMeme Hei
 
Cross-Validation and Big Data Partitioning Via Experimental Design
Cross-Validation and Big Data Partitioning Via Experimental DesignCross-Validation and Big Data Partitioning Via Experimental Design
Cross-Validation and Big Data Partitioning Via Experimental Designdans_salford
 
Как приготовить тестовые данные для Big Data проекта. Пример из практики
Как приготовить тестовые данные для Big Data проекта. Пример из практикиКак приготовить тестовые данные для Big Data проекта. Пример из практики
Как приготовить тестовые данные для Big Data проекта. Пример из практикиSQALab
 
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom IndustryCloudera, Inc.
 
Map-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on MulticoreMap-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicoreillidan2004
 
How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?Anubhav Jain
 
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELA TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELJenny Liu
 
Large Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate DescentLarge Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate DescentShaleen Kumar Gupta
 

Similar to Clustering (20)

A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Efficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceEfficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/Reduce
 
Efficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceEfficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/Reduce
 
Clustering
ClusteringClustering
Clustering
 
H235055
H235055H235055
H235055
 
Cross-Validation and Big Data Partitioning Via Experimental Design
Cross-Validation and Big Data Partitioning Via Experimental DesignCross-Validation and Big Data Partitioning Via Experimental Design
Cross-Validation and Big Data Partitioning Via Experimental Design
 
Как приготовить тестовые данные для Big Data проекта. Пример из практики
Как приготовить тестовые данные для Big Data проекта. Пример из практикиКак приготовить тестовые данные для Big Data проекта. Пример из практики
Как приготовить тестовые данные для Big Data проекта. Пример из практики
 
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
 
MaPU-HPCA2016
MaPU-HPCA2016MaPU-HPCA2016
MaPU-HPCA2016
 
Blinkdb
BlinkdbBlinkdb
Blinkdb
 
Map-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on MulticoreMap-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicore
 
Real Time Geodemographics
Real Time GeodemographicsReal Time Geodemographics
Real Time Geodemographics
 
How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?
 
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELA TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
 
Large Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate DescentLarge Scale Kernel Learning using Block Coordinate Descent
Large Scale Kernel Learning using Block Coordinate Descent
 

More from Anjan Goswami

Learning to Diversify for E-commerce Search with Multi-Armed Bandit}
Learning to Diversify for E-commerce Search with Multi-Armed Bandit}Learning to Diversify for E-commerce Search with Multi-Armed Bandit}
Learning to Diversify for E-commerce Search with Multi-Armed Bandit}Anjan Goswami
 
Discovery In Commerce Search
Discovery In Commerce SearchDiscovery In Commerce Search
Discovery In Commerce SearchAnjan Goswami
 
Machine-Learned Ranking Algorithms for E-commerce Search and Recommendation A...
Machine-Learned Ranking Algorithms for E-commerce Search and Recommendation A...Machine-Learned Ranking Algorithms for E-commerce Search and Recommendation A...
Machine-Learned Ranking Algorithms for E-commerce Search and Recommendation A...Anjan Goswami
 
Controlled Experiments for Decision-Making in e-Commerce Search
Controlled Experiments for Decision-Making in e-Commerce SearchControlled Experiments for Decision-Making in e-Commerce Search
Controlled Experiments for Decision-Making in e-Commerce SearchAnjan Goswami
 
Spelling correction systems for e-commerce platforms
Spelling correction systems for e-commerce platformsSpelling correction systems for e-commerce platforms
Spelling correction systems for e-commerce platformsAnjan Goswami
 
Topic Models Based Understanding of Supply and Demand Side of an eCommerce En...
Topic Models Based Understanding of Supply and Demand Side of an eCommerce En...Topic Models Based Understanding of Supply and Demand Side of an eCommerce En...
Topic Models Based Understanding of Supply and Demand Side of an eCommerce En...Anjan Goswami
 
Assessing product image quality for online shopping
Assessing product image quality for online shoppingAssessing product image quality for online shopping
Assessing product image quality for online shopping Anjan Goswami
 

More from Anjan Goswami (8)

Learning to Diversify for E-commerce Search with Multi-Armed Bandit}
Learning to Diversify for E-commerce Search with Multi-Armed Bandit}Learning to Diversify for E-commerce Search with Multi-Armed Bandit}
Learning to Diversify for E-commerce Search with Multi-Armed Bandit}
 
Discovery In Commerce Search
Discovery In Commerce SearchDiscovery In Commerce Search
Discovery In Commerce Search
 
Machine-Learned Ranking Algorithms for E-commerce Search and Recommendation A...
Machine-Learned Ranking Algorithms for E-commerce Search and Recommendation A...Machine-Learned Ranking Algorithms for E-commerce Search and Recommendation A...
Machine-Learned Ranking Algorithms for E-commerce Search and Recommendation A...
 
Controlled Experiments for Decision-Making in e-Commerce Search
Controlled Experiments for Decision-Making in e-Commerce SearchControlled Experiments for Decision-Making in e-Commerce Search
Controlled Experiments for Decision-Making in e-Commerce Search
 
Spelling correction systems for e-commerce platforms
Spelling correction systems for e-commerce platformsSpelling correction systems for e-commerce platforms
Spelling correction systems for e-commerce platforms
 
Reputation systems
Reputation systemsReputation systems
Reputation systems
 
Topic Models Based Understanding of Supply and Demand Side of an eCommerce En...
Topic Models Based Understanding of Supply and Demand Side of an eCommerce En...Topic Models Based Understanding of Supply and Demand Side of an eCommerce En...
Topic Models Based Understanding of Supply and Demand Side of an eCommerce En...
 
Assessing product image quality for online shopping
Assessing product image quality for online shoppingAssessing product image quality for online shopping
Assessing product image quality for online shopping
 

Recently uploaded

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 

Recently uploaded (20)

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 

Clustering

  • 1. 06/03/15 Efficient Clustering Algorithms for Out-of-Core, Distributed and Streaming Data Anjan Goswami The Ohio State University
  • 2. 06/03/15 A World Immersed in Data! Huge amount of data. Distributed Data. Streaming Data. EXAMPLES  Business Data Wal-Mart (20M transaction per day) AT&T (300M calls per day) Mobil oil exploration data (100TB)  Satellite and Sensor Data NASA, EOS project: 50 GB per hour Sloan Digital Sky Survey (SDSS) : 300M celestial objects, DR1 contains 1 TB of catalogs  Scientific Simulation BioSimGrid (10-40TB trajectory data)  Biology Data GenBank (>30 Billion base pairs, >30 Million sequences, 2003)  Web logs  Amazon, Ebay, MSN
  • 3. 06/03/15 Some Challenges Massive Data (out of core). Distributed and Possibly imbalanced. Streaming Data.
  • 4. 06/03/15 Overview of My Research Out-of-Core Datasets Implementation and Evaluation of Fast and Exact KMeans Algorithm [ICDM’04] Distributed Datasets Distributed Fast and Exact KMeans Algorithm [Under Submission] Streaming Data  Two exact algorithms and two approximate algorithms for clustering evolving streaming data. [under preparation]
  • 5. 06/03/15 Overview of the Talk K Center Clustering Definition KMeans Algorithom Out of Core Data: FEKM and its Evaluation Distributed Data: DFEKM Streaming Data: Four Algorithms.
  • 6. 06/03/15 K Center Clustering Problem Input: A set of data points. Output: K centers such that sum squared distance of each point to its closest center is minimum. This problem has been proved as NP- Hard.
  • 7. 06/03/15 K-means Algorithm Developed by Hartigan in 1967. Proven to converge to local minima Bottou/Bengio 1995. Proven as a special case of Newton’s stepest descent. Cheng 1995.
  • 8. 06/03/15 Variants of Fast K-means Algorithms • Pelleg and Moore: use of kd-tree [in memory] • Bradley and Fayyad: iterative sampling and compression. [out of core, approx] • Farnstorm simplified previous. • Domingos and Hulten: Use Hoeffding inequality for successive sampling. [sampling]
  • 9. 06/03/15 Roadmap of Presentation  Clustering Algorithms. - Out of Core data. FEKM Evaluation of FEKM - Distributed data. - Streaming data.
  • 10. 06/03/15 Motivation to FEKM Existing algorithms give approximate cluster centers. No exact clustering algorithms for out of core data.
  • 13. 06/03/15 K-means clustering on samplesConfidence Radius An estimation of the upper-bound of the distance between the sample center and the corresponding k-means center!
  • 14. 06/03/15 Boundary Points d1 d2 C1,δ1 C2,δ2 |d1-d2|<δ1+δ2 Another center is close enough compared to the closest center!
  • 18. 06/03/15 Basic ideas of FEKM Run KMeans on sample and build CA table storing centers at each iteration. Estimate confidence radius from the sample. In one scan classify the data points as stable and boundary points for each row of the table.
  • 19. 06/03/15 Basic Idea of FEKM Store sufficient statistics of core points and the boundary points.  After one scan, recompute centers at each row with the new information.
  • 20. 06/03/15 Basic Idea of FEKM Verify if the centers within the confidence radius. If the center within confidence radius  computed exact centers as KMeans algorithm. If the centers not within the confidence radius  use these as new initial centers and repeat the whole process.
  • 21. 06/03/15 Discussion Correctness FEKM guarantees to find the same clusters as the original k-means Performance Analysis Determined by the number of passes of the dataset
  • 22. 06/03/15 Experimental Setup and Datasets Machines 700 MHz Pentium Processors and 1 GB memory Synthetic Datasets: Similar to the ones used by Bradley et al 18 1.1GB datasets and 2 4.4GB datasets 5, 10, 20, 50, 100, and 200 dimensions 5, 10 and 20 clusters Real Datasets (UCI ML archive) KDDCup99 (38 dim, 1.8 GB, k=5) Corel image (32 dim, 1.9GB, k=16) Reuters text database (258 dim, 2 GB, k=25) Super-sampling, normalized [0,1]
  • 23. 06/03/15 Performance of k-means and FEKM Algorithms on synthetic Datasets Data No. iterations Time of k-means Time of FEKM Samples (%) Passes Size Dimensions 1.1GB 200 100 54862.33 27388.85 10 2 200 3 1898.65 584.88 5 1 100 100 41029.15 18106.51 10 2 100 3 1233.12 585.63 5 1 50 3 1796.30 882.36 5 1 20 10 5335.15 2112.42 5 2 10 6 3919.08 1643.75 5 1 5 6 4619.95 2353.41 5 1 4.4GB 100 2 4393.02 2931.53 10 1 100 10 21985.62 8194.07 10 1 100 10 21985.62 7467.53 5 1Running Time in Seconds, 20 clusters
  • 24. 06/03/15 Performance of k-means and FEKM Algorithms on Real Datasets Data No. iterations Time of k-means Time of FEKM Samples (%) Passes Squared Error Kdd99 19 7151 2317 10 2 4.0 kdd99 19 7151 2529 15 2 3.5 kdd99 19 7151 2136 5 2 4.2 Corel 43 28442 10503 10 3 2.2 Corel 43 28442 12603 15 3 2.15 Corel 43 28442 9342 5 3 3.24 Reuter 20 41290 10311 10 2 10.1 Reuter 20 41290 11204 15 2 8.6 Reuter 20 41290 9214 5 2 14.9 Running Time in Seconds, Squared Error between final centers and the centers after sampling
  • 25. 06/03/15 Summary of Results Typically requires only one or a small number of passes on the entire dataset Provably produces the same cluster centers as reported by the original k-means algorithm Experimental results from a number of real and synthetic datasets show speedups between a factor of 2 and 4.5, as compared to k-means
  • 26. 06/03/15 Roadmap of Presentation Clustering Algorithms. - Out of Core data FEKM Evaluation of FEKM - Distributed data - Streaming data
  • 27. 06/03/15 Distributed Clustering Growing number of distributed data repositories. Downloading or Merging data from remote sources not efficient. Not sufficient work on Distributed Data Mining algorithms. Existing approximate Algorithms.
  • 28. 06/03/15 DFEKM Idea Extend Concepts of FEKM. Communicate CA Tables. Use stable and boundary points and confidence radius as in FEKM.
  • 30. 06/03/15 Distributed Fast and Exact KMeans (DFEKM) Client nodes  send samples to the central node. The central node   runs kmeans on the sample  build the CAtable.  send it back to all the nodes. Client nodes   scans its data sets once  keep suff. Stat. of the stable points  keep the boundary points.  Send CA Tables to central node.
  • 31. 06/03/15 Distributed Fast and Exact KMeans (DFEKM) Central Node   merges all CA Table.  re-compute exact centers. Central Node   Verify If the new centers are within confidence radius.  If not the central node sends new initial centers to all nodes and the whole process repeats.
  • 32. 06/03/15 Experiment Design Performance when the nodes are loosely coupled. Performance in presence of imbalanced data. Comparison with parallel KMeans. Performance with varying number of nodes for the above scenarios. Varying number of clusters, data dimension.  Synthetic and Real Data Sets.
  • 33. 06/03/15 Parallel KMeans Central node send initial centers to all nodes. At each pass  Each node scans its data, collects the sufficient statistics (number, linear sum and square sum of the data points). The central node  gathers the data. merge the statistics from all the nodes. Recompute Centers. Sends the new centers to all other nodes. Repeat until the centers do not move.
  • 34. 06/03/15 Parallel KMeans Pros and Cons Efficient  scales linearly with the number of nodes. The performance deteriorate for loosely coupled machines. Problem comes with the imbalanced data.
  • 35. 06/03/15 Related Research An implementation of parallel KMeans using MPI by Dhillon and Modha. This does not consider loosely coupled environment or imbalance in data. A new algorithm is required.
  • 36. 06/03/15 Remedy Run parallel KMeans on remote nodes. we can download the data in a central node and can run KMeans algorithm.
  • 37. 06/03/15 Experimental Setup and Datasets Machines IA-32 cluster 2 900 MHz Itanium-2 Processors 4 GB main memory Myrinet Switch 2Gb/s Datasets Synthetic Datasets Similar to the ones used by Bradley et al 4 1.1GB datasets. 10 and 100 dimensions 5 and 20 clusters Real Datasets (UCI ML archive) KDDCup99 (38 dim, 1.8 GB, k=5) Corel image (32 dim, 1.9GB, k=16) Super-sampling, normalized [0,1]
  • 38. 06/03/15 Comparison with Parallel KMeans with increasing delay in communication between nodes.
  • 39. 06/03/15 Comparisons with parallel KMeans in presence of imbalanced data.
  • 40. 06/03/15 Summary of Results Faster than parallel with higher delay in communication. Faster than parallel Kmeans when data is highly imbalanced (50%, 20%, 20% and 10%) Faster than KMeans for single node (sequential). Parallel KMeans faster for 4 nodes with delay less than 100s.
  • 41. 06/03/15 Roadmap of Presentation Clustering Algorithms. - Out of Core data FEKM Evaluation of FEKM - Distributed data - Streaming data
  • 42. 06/03/15 Challenges Required online or preferably one pass algorithm. It should be able to deal with evolving nature of streaming data.
  • 43. 06/03/15 Related Work Motwani: One pass approximate algorithm (Stream) for k median clustering problem. Motwani: Algorithm for maintaining clusters in a sliding window. J. Han: A framework to deal with evolving streaming data.
  • 44. 06/03/15 Evolving Clusters in Sliding Window: Little Movement of Centers
  • 45. 06/03/15 Evolving Clusters in Sliding Window: Arbitrary Movements of Cluster Centers
  • 46. 06/03/15 Four Algorithms FESK Fast and Exact Stream KMeans FASK Fast and Approximate Stream KMeans EADSK Exact Adaptive Stream KMeans AADSK Approximate Adaptive Stream KMeans
  • 51. 06/03/15 Exact and Approximate Adaptive Stream KMeans (EADSK and AADSK) Run KMeans on the data in first window. Estimate confidence radius, cluster radius, threshold number of points at each cluster. In New Window: Find stable, boundary and outlier points.
  • 52. 06/03/15 Exact and Approximate Adaptive Stream KMeans (EADSK and AADSK) Keep sufficient statistics of stable points.  Keep the boundary and outlier points.
  • 53. 06/03/15 Exact and Approximate Adaptive Stream KMeans (EADSK and AADSK) If number of outliers > the threshold  Case 1: Cluster Center Movement is substantial. Otherwise  Case 2: Movement is within conf radius.
  • 54. 06/03/15 Exact and Approximate Adaptive Stream KMeans (EADSK and AADSK) EADSK Case 1: runs KMeans on the complete data set. Case 2: runs a weighted KMeans on the saved boundary and outlier points and using the sufficient statistics of core points. AADSK Case 1: runs weighted KMeans with outlier and boundary points. Case 2: use stable and boundary points and runs weighted KMeans.
  • 57. 06/03/15 Exact and Approximate Stream KMeans (FESK and FASK) These two algorithms do not use outliers. At initial window  Estimate Confidence Radius. At Next Window  Finds Stable and Boundary Points. Detects if the centers move outside the confidence region. If true  FESK runs KMeans on the complete data set. FASK runs the weighted KMeans algorithm on stable and boundary points. If false  Both FESK and FASK have computed the exact centers as KMeans
  • 58. 06/03/15 Experiments and Evaluations Experiments with Evolving Clusters On very large data sets. On smaller data at each window.
  • 59. 06/03/15 Experimental Setup and Datasets Machines 700 MHz Pentium Processors 1 GB memory Datasets Synthetic Datasets Similar to the ones used by Bradley et al 6 2.2 GB datasets and 2 small datasets 5, 20, and 100 dimensions 5 and 20 clusters Real Datasets (UCI ML archive) KDDCup99 (38 dim, 1.8 GB, k=5) Corel image (32 dim, 1.9GB, k=16) Super-sampling, normalized [0,1]
  • 60. 06/03/15 Comparison of KMean, FESK and FASK for Very Large Data Set (2.2GB)
  • 61. 06/03/15 Comparison of Kmean with EADSK and AADSK with Very Large Data Set (2.2GB)
  • 62. 06/03/15 Comparison of KMean, FESK and FASK with C5d20 dataset with small number of points in a window and centers are moving within Confidence Region
  • 63. 06/03/15 Comparison with EADSK and AADSK with the c5d20 data set
  • 64. 06/03/15 Centers are moving outside the confidence region in every window
  • 65. 06/03/15 Centers are moving outside confidence region in every window
  • 66. 06/03/15 Characteristics of the Algorithms All four algorithms perform better than multi- pass KMeans in case when centers do not move outside confidence region. FESK and EADSK always find exact centers as KMeans. FESK and EADSK perform no better than KMeans if centers move outside confidence region for any window. FASK and AADSK perform much faster than KMeans for this case. AADSK is able to compute better result than FASK for evolving streaming data.
  • 67. 06/03/15 Conclusion Efficient variant of KMeans algorithm for out of core, distributed and streaming data. Evolving nature of streaming data is considered. Streaming data clustering is integrated with change detection.