SlideShare a Scribd company logo
Determining the k in k-means
with MapReduce
Thibault Debatty, Pietro Michiardi,
Wim Mees & Olivier Thonnard
Algorithms for MapReduce and Beyond 2014
Determining the k in k-means with MapReduce 2
Clustering & k-means
● Clustering
● K-means
[Stuart P. Lloyd. Least squares quantization in pcm. IEEE
Transactions on Information Theory, 28:129–137, 1982.]
– 1982 (a great year!)
– But still largely used
– Drawbacks (amongst others):
● Local minimum
● K is a parameter!
Determining the k in k-means with MapReduce 3
Clustering & k-means
● Determine k:
– VERY difficult
[Anil K Jain. Data Clustering : 50 Years Beyond K-Means.
Pattern Recognition Letters, 2009]
– Using cluster evaluation metrics:
Dunn's index, Elbow, Silhouette, “jump method” (based on
information theory), “Gap statistic”,...
O(k²)
Determining the k in k-means with MapReduce 4
G-means
● G-means
[Greg Hamerly and Charles Elkan. Learning the k in k-
means. In Neural Information Processing Systems. MIT
Press, 2003]
● K-means : points in each cluster are
spherically distributed around the center
Source:scikit-learn
Determining the k in k-means with MapReduce 5
G-means
● G-means
[Greg Hamerly and Charles Elkan. Learning the k in k-
means. In Neural Information Processing Systems. MIT
Press, 2003]
● K-means : points in each cluster are
spherically distributed around the center
normality test &
recursion
Determining the k in k-means with MapReduce 6
G-means
Dataset
Determining the k in k-means with MapReduce 7
G-means
1. Pick 2 centers
Determining the k in k-means with MapReduce 8
G-means
2. k-means
Determining the k in k-means with MapReduce 9
G-means
3. Project
Determining the k in k-means with MapReduce 10
G-means
3. Project
Determining the k in k-means with MapReduce 11
G-means
Normal?
No
=> recursion
4. Normality test
Determining the k in k-means with MapReduce 12
G-means
5. Recursion
Determining the k in k-means with MapReduce 13
MapReduce G-means
● Challenges:
1. Reduce I/O operations
2. Reduce number of jobs
3. Maximize parallelism
4. Limit memory usage
Determining the k in k-means with MapReduce 14
MapReduce G-means
● Challenges:
1. Reduce I/O operations
2. Reduce number of jobs
3. Maximize parallelism
4. Limit memory usage
Determining the k in k-means with MapReduce 15
MapReduce G-means
2. Reduce number of jobs
PickInitialCenters
while Not ClusteringCompleted do
KMeans
KMeansAndFindNewCenters
TestClusters
end while
Determining the k in k-means with MapReduce 16
MapReduce G-means
TestClusters
Map(key, point)
Find cluster
Find vector
Project point on vector
Emit(cluster, projection)
end procedure
Reduce(cluster, projections)
Build a vector
ADtest(vector)
if normal then
Mark cluster
end if
end procedure
3. Maximize
parallelism
4. Limit memory
usage
Determining the k in k-means with MapReduce 17
MapReduce G-means
TestClusters
Map(key, point)
Find cluster
Find vector
Project point on vector
Emit(cluster, projection)
end procedure
Reduce(cluster, projections)
Build a vector
ADtest(vector)
if normal then
Mark cluster
end if
end procedure
Bottleneck
3. Maximize
parallelism
4. Limit memory
usage (risk of crash)
Determining the k in k-means with MapReduce 18
MapReduce G-means
TestClusters
Map(key, point)
Find cluster
Find vector
Project point on vector
Emit(cluster, projection)
end procedure
Reduce(cluster, projections)
Build a vector
ADtest(vector)
if normal then
Mark cluster
end if
end procedure
TestFewClusters
Map(key, point)
Find cluster
Find vector
Project point on vector
Add projection to list
end procedure
Close()
For each list do
Build a vector
A2 = ADtest(vector)
Emit(cluster, A2)
End for each
end procedure
In memory combiner
Determining the k in k-means with MapReduce 19
MapReduce G-means
TestClusters
Map(key, point)
Find cluster
Find vector
Project point on vector
Emit(cluster, projection)
end procedure
Reduce(cluster, projections)
Build a vector
ADtest(vector)
if normal then
Mark cluster
end if
end procedure
TestFewClusters
Map(key, point)
Find cluster
Find vector
Project point on vector
Add projection to list
end procedure
Close()
For each list do
Build a vector
A2 = ADtest(vector)
Emit(cluster, A2)
End for each
end procedure
#clusters > #reducers
&
Estimated required memory < Java heap
Determining the k in k-means with MapReduce 20
MapReduce G-means
TestClusters
Map(key, point)
Find cluster
Find vector
Project point on vector
Emit(cluster, projection)
end procedure
Reduce(cluster, projections)
Build a vector
ADtest(vector)
if normal then
Mark cluster
end if
end procedure
TestFewClusters
Map(key, point)
Find cluster
Find vector
Project point on vector
Add projection to list
end procedure
Close()
For each list do
Build a vector
A2 = ADtest(vector)
Emit(cluster, A2)
End for each
end procedure
#clusters > #reducers
&
Estimated required memory < Java heap
Experimentally:
64 Bytes / point
Determining the k in k-means with MapReduce 21
Comparison
MR multi-k-means MR G-means
Speed
Quality
all possible values of k
in a single job
Determining the k in k-means with MapReduce 22
Comparison
MR multi-k-means MR G-means
Speed O(nk²) computations O(nk) computations
But:
● more iterations
● more dataset reads
● log2
(k)
Quality New centers added if and
where needed
But:
tends to overestimate k!
Determining the k in k-means with MapReduce 23
Experimental results : Speed
● Hadoop
● Synthetic dataset
● 10M points in R10
● Euclidean distance
● 8 machines
Determining the k in k-means with MapReduce 24
Experimental results : Quality
● Hadoop
● Synthetic dataset
● 10M points in R10
● Euclidean distance
● 8 machines
k 100 200 400
kfound
150 279 639
Within Cluster Sum of Square
(less is better)
MR G-means 3.34 3.33 3.23
multi-k-means 3.71 3.6 3.39
(with same k)
x ~1.5
Determining the k in k-means with MapReduce 25
Conclusions & future work...
● MapReduce algorithm to determine k
● Running time proportional to k
● Future:
– Overestimation of k
– Test on real data
– Test scalability
– Reduce I/O (using Spark)
– Consider skewed data
– Consider impact of machine failure
Determining the k in k-means with MapReduce 26
Thank you!

More Related Content

What's hot

3D Watershed Celebes
3D Watershed Celebes3D Watershed Celebes
3D Watershed Celebes
Hartanto Sanjaya
 
3D Analyst - Watershed and Stream Network
3D Analyst - Watershed and Stream Network3D Analyst - Watershed and Stream Network
3D Analyst - Watershed and Stream Network
Hartanto Sanjaya
 
3D Analyst - Cut and Fill
3D Analyst - Cut and Fill3D Analyst - Cut and Fill
3D Analyst - Cut and Fill
Hartanto Sanjaya
 
Presentation for Numerical Field Theory
Presentation for Numerical Field TheoryPresentation for Numerical Field Theory
Presentation for Numerical Field TheoryIndraneel Pole
 
3D Analyst - Watershed from SRTM
3D Analyst - Watershed from SRTM3D Analyst - Watershed from SRTM
3D Analyst - Watershed from SRTM
Hartanto Sanjaya
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
Haripritha
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
Adrian Florea
 
MapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large ClustersMapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large Clusters
kazuma_sato
 
Map-Side Merge Joins for Scalable SPARQL BGP Processing
Map-Side Merge Joins for Scalable SPARQL BGP ProcessingMap-Side Merge Joins for Scalable SPARQL BGP Processing
Map-Side Merge Joins for Scalable SPARQL BGP Processing
Alexander Schätzle
 
BDC-presentation
BDC-presentationBDC-presentation
BDC-presentationPavel Popa
 
3D Analyst - Watershed
3D Analyst - Watershed3D Analyst - Watershed
3D Analyst - Watershed
Hartanto Sanjaya
 
4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...
4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...
4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...GISRUK conference
 
3D Analyst - Watershed, Padang
3D Analyst - Watershed, Padang3D Analyst - Watershed, Padang
3D Analyst - Watershed, Padang
Hartanto Sanjaya
 
3D Analyst - Watershed, Tomohon
3D Analyst - Watershed, Tomohon3D Analyst - Watershed, Tomohon
3D Analyst - Watershed, Tomohon
Hartanto Sanjaya
 
Mapreduce total order sorting technique
Mapreduce total order sorting techniqueMapreduce total order sorting technique
Mapreduce total order sorting technique
Uday Vakalapudi
 
SparkNet presentation
SparkNet presentationSparkNet presentation
SparkNet presentation
Sneh Pahilwani
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
Subhas Kumar Ghosh
 
On Skyline Groups
On Skyline GroupsOn Skyline Groups
3D Analyst Watershed Lombok
3D Analyst Watershed  Lombok3D Analyst Watershed  Lombok
3D Analyst Watershed Lombok
Hartanto Sanjaya
 
Optimization of graph storage using GoFFish
Optimization of graph storage using GoFFishOptimization of graph storage using GoFFish
Optimization of graph storage using GoFFish
Anushree Prasanna Kumar
 

What's hot (20)

3D Watershed Celebes
3D Watershed Celebes3D Watershed Celebes
3D Watershed Celebes
 
3D Analyst - Watershed and Stream Network
3D Analyst - Watershed and Stream Network3D Analyst - Watershed and Stream Network
3D Analyst - Watershed and Stream Network
 
3D Analyst - Cut and Fill
3D Analyst - Cut and Fill3D Analyst - Cut and Fill
3D Analyst - Cut and Fill
 
Presentation for Numerical Field Theory
Presentation for Numerical Field TheoryPresentation for Numerical Field Theory
Presentation for Numerical Field Theory
 
3D Analyst - Watershed from SRTM
3D Analyst - Watershed from SRTM3D Analyst - Watershed from SRTM
3D Analyst - Watershed from SRTM
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
 
MapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large ClustersMapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large Clusters
 
Map-Side Merge Joins for Scalable SPARQL BGP Processing
Map-Side Merge Joins for Scalable SPARQL BGP ProcessingMap-Side Merge Joins for Scalable SPARQL BGP Processing
Map-Side Merge Joins for Scalable SPARQL BGP Processing
 
BDC-presentation
BDC-presentationBDC-presentation
BDC-presentation
 
3D Analyst - Watershed
3D Analyst - Watershed3D Analyst - Watershed
3D Analyst - Watershed
 
4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...
4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...
4A_ 3_Parallel k-means clustering using gp_us for the geocomputation of real-...
 
3D Analyst - Watershed, Padang
3D Analyst - Watershed, Padang3D Analyst - Watershed, Padang
3D Analyst - Watershed, Padang
 
3D Analyst - Watershed, Tomohon
3D Analyst - Watershed, Tomohon3D Analyst - Watershed, Tomohon
3D Analyst - Watershed, Tomohon
 
Mapreduce total order sorting technique
Mapreduce total order sorting techniqueMapreduce total order sorting technique
Mapreduce total order sorting technique
 
SparkNet presentation
SparkNet presentationSparkNet presentation
SparkNet presentation
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
 
On Skyline Groups
On Skyline GroupsOn Skyline Groups
On Skyline Groups
 
3D Analyst Watershed Lombok
3D Analyst Watershed  Lombok3D Analyst Watershed  Lombok
3D Analyst Watershed Lombok
 
Optimization of graph storage using GoFFish
Optimization of graph storage using GoFFishOptimization of graph storage using GoFFish
Optimization of graph storage using GoFFish
 

Similar to Determining the k in k-means with MapReduce

MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
Gabriela Agustini
 
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Austin Benson
 
Scalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceScalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduce
Kyong-Ha Lee
 
Fa18_P2.pptx
Fa18_P2.pptxFa18_P2.pptx
Fa18_P2.pptx
Md Abul Hayat
 
141205 graphulo ingraphblas
141205 graphulo ingraphblas141205 graphulo ingraphblas
141205 graphulo ingraphblas
graphulo
 
141222 graphulo ingraphblas
141222 graphulo ingraphblas141222 graphulo ingraphblas
141222 graphulo ingraphblas
MIT
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
Vivian S. Zhang
 
Dp idp exploredb
Dp idp exploredbDp idp exploredb
Dp idp exploredb
George Valkanas
 
Optimization Techniques
Optimization TechniquesOptimization Techniques
Optimization Techniques
Ajay Bidyarthy
 
Tall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceTall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduce
David Gleich
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduce
David Gleich
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
prithan
 
parameterized complexity for graph Motif
parameterized complexity for graph Motifparameterized complexity for graph Motif
parameterized complexity for graph Motif
AMR koura
 
141031_Lagrange Relaxation Based Method for the QoS Routing Problem
141031_Lagrange Relaxation Based Method for the QoS Routing Problem 141031_Lagrange Relaxation Based Method for the QoS Routing Problem
141031_Lagrange Relaxation Based Method for the QoS Routing Problem
Eunice Lin
 
Mypreson 27
Mypreson 27Mypreson 27
Mypreson 27
Venkatesh Nandigama
 
K means clustering algorithm
K means clustering algorithmK means clustering algorithm
K means clustering algorithm
Darshak Mehta
 
Oracle-based algorithms for high-dimensional polytopes.
Oracle-based algorithms for high-dimensional polytopes.Oracle-based algorithms for high-dimensional polytopes.
Oracle-based algorithms for high-dimensional polytopes.
Vissarion Fisikopoulos
 
Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphs
David Gleich
 
LSH
LSHLSH
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
Austin Benson
 

Similar to Determining the k in k-means with MapReduce (20)

MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
Tall-and-skinny Matrix Computations in MapReduce (ICME colloquium)
 
Scalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceScalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduce
 
Fa18_P2.pptx
Fa18_P2.pptxFa18_P2.pptx
Fa18_P2.pptx
 
141205 graphulo ingraphblas
141205 graphulo ingraphblas141205 graphulo ingraphblas
141205 graphulo ingraphblas
 
141222 graphulo ingraphblas
141222 graphulo ingraphblas141222 graphulo ingraphblas
141222 graphulo ingraphblas
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
Dp idp exploredb
Dp idp exploredbDp idp exploredb
Dp idp exploredb
 
Optimization Techniques
Optimization TechniquesOptimization Techniques
Optimization Techniques
 
Tall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduceTall and Skinny QRs in MapReduce
Tall and Skinny QRs in MapReduce
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduce
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
parameterized complexity for graph Motif
parameterized complexity for graph Motifparameterized complexity for graph Motif
parameterized complexity for graph Motif
 
141031_Lagrange Relaxation Based Method for the QoS Routing Problem
141031_Lagrange Relaxation Based Method for the QoS Routing Problem 141031_Lagrange Relaxation Based Method for the QoS Routing Problem
141031_Lagrange Relaxation Based Method for the QoS Routing Problem
 
Mypreson 27
Mypreson 27Mypreson 27
Mypreson 27
 
K means clustering algorithm
K means clustering algorithmK means clustering algorithm
K means clustering algorithm
 
Oracle-based algorithms for high-dimensional polytopes.
Oracle-based algorithms for high-dimensional polytopes.Oracle-based algorithms for high-dimensional polytopes.
Oracle-based algorithms for high-dimensional polytopes.
 
Big data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphsBig data matrix factorizations and Overlapping community detection in graphs
Big data matrix factorizations and Overlapping community detection in graphs
 
LSH
LSHLSH
LSH
 
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
Direct QR factorizations for tall-and-skinny matrices in MapReduce architectu...
 

More from Thibault Debatty

An introduction to similarity search and k-nn graphs
An introduction to similarity search and k-nn graphsAn introduction to similarity search and k-nn graphs
An introduction to similarity search and k-nn graphs
Thibault Debatty
 
Blockchain for dummies
Blockchain for dummiesBlockchain for dummies
Blockchain for dummies
Thibault Debatty
 
Building a Cyber Range for training Cyber Defense Situation Awareness
Building a Cyber Range for training Cyber Defense Situation AwarenessBuilding a Cyber Range for training Cyber Defense Situation Awareness
Building a Cyber Range for training Cyber Defense Situation Awareness
Thibault Debatty
 
Design and analysis of distributed k-nearest neighbors graph algorithms
Design and analysis of distributed k-nearest neighbors graph algorithmsDesign and analysis of distributed k-nearest neighbors graph algorithms
Design and analysis of distributed k-nearest neighbors graph algorithms
Thibault Debatty
 
A comparative analysis of visualisation techniques to achieve CySA in the mi...
A comparative analysis of visualisation techniques to achieve CySA in the  mi...A comparative analysis of visualisation techniques to achieve CySA in the  mi...
A comparative analysis of visualisation techniques to achieve CySA in the mi...
Thibault Debatty
 
Cyber Range
Cyber RangeCyber Range
Cyber Range
Thibault Debatty
 
Easy Server Monitoring
Easy Server MonitoringEasy Server Monitoring
Easy Server Monitoring
Thibault Debatty
 
Data diode
Data diodeData diode
Data diode
Thibault Debatty
 
USB Portal
USB PortalUSB Portal
USB Portal
Thibault Debatty
 
Smart Router
Smart RouterSmart Router
Smart Router
Thibault Debatty
 
Web shell detector
Web shell detectorWeb shell detector
Web shell detector
Thibault Debatty
 
Graph based APT detection
Graph based APT detectionGraph based APT detection
Graph based APT detection
Thibault Debatty
 
Multi-Agent System for APT Detection
Multi-Agent System for APT DetectionMulti-Agent System for APT Detection
Multi-Agent System for APT Detection
Thibault Debatty
 
Building k-nn Graphs From Large Text Data
Building k-nn Graphs From Large Text DataBuilding k-nn Graphs From Large Text Data
Building k-nn Graphs From Large Text Data
Thibault Debatty
 
Parallel SPAM Clustering with Hadoop
Parallel SPAM Clustering with HadoopParallel SPAM Clustering with Hadoop
Parallel SPAM Clustering with HadoopThibault Debatty
 

More from Thibault Debatty (15)

An introduction to similarity search and k-nn graphs
An introduction to similarity search and k-nn graphsAn introduction to similarity search and k-nn graphs
An introduction to similarity search and k-nn graphs
 
Blockchain for dummies
Blockchain for dummiesBlockchain for dummies
Blockchain for dummies
 
Building a Cyber Range for training Cyber Defense Situation Awareness
Building a Cyber Range for training Cyber Defense Situation AwarenessBuilding a Cyber Range for training Cyber Defense Situation Awareness
Building a Cyber Range for training Cyber Defense Situation Awareness
 
Design and analysis of distributed k-nearest neighbors graph algorithms
Design and analysis of distributed k-nearest neighbors graph algorithmsDesign and analysis of distributed k-nearest neighbors graph algorithms
Design and analysis of distributed k-nearest neighbors graph algorithms
 
A comparative analysis of visualisation techniques to achieve CySA in the mi...
A comparative analysis of visualisation techniques to achieve CySA in the  mi...A comparative analysis of visualisation techniques to achieve CySA in the  mi...
A comparative analysis of visualisation techniques to achieve CySA in the mi...
 
Cyber Range
Cyber RangeCyber Range
Cyber Range
 
Easy Server Monitoring
Easy Server MonitoringEasy Server Monitoring
Easy Server Monitoring
 
Data diode
Data diodeData diode
Data diode
 
USB Portal
USB PortalUSB Portal
USB Portal
 
Smart Router
Smart RouterSmart Router
Smart Router
 
Web shell detector
Web shell detectorWeb shell detector
Web shell detector
 
Graph based APT detection
Graph based APT detectionGraph based APT detection
Graph based APT detection
 
Multi-Agent System for APT Detection
Multi-Agent System for APT DetectionMulti-Agent System for APT Detection
Multi-Agent System for APT Detection
 
Building k-nn Graphs From Large Text Data
Building k-nn Graphs From Large Text DataBuilding k-nn Graphs From Large Text Data
Building k-nn Graphs From Large Text Data
 
Parallel SPAM Clustering with Hadoop
Parallel SPAM Clustering with HadoopParallel SPAM Clustering with Hadoop
Parallel SPAM Clustering with Hadoop
 

Recently uploaded

Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
nodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptxnodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptx
alishadewangan1
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
IshaGoswami9
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
MAGOTI ERNEST
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
sanjana502982
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
Wasswaderrick3
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Sérgio Sacani
 
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
RASHMI M G
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Studia Poinsotiana
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
AlaminAfendy1
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
pablovgd
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 

Recently uploaded (20)

Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
nodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptxnodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptx
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
 
Toxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and ArsenicToxic effects of heavy metals : Lead and Arsenic
Toxic effects of heavy metals : Lead and Arsenic
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
 
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 

Determining the k in k-means with MapReduce

  • 1. Determining the k in k-means with MapReduce Thibault Debatty, Pietro Michiardi, Wim Mees & Olivier Thonnard Algorithms for MapReduce and Beyond 2014
  • 2. Determining the k in k-means with MapReduce 2 Clustering & k-means ● Clustering ● K-means [Stuart P. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28:129–137, 1982.] – 1982 (a great year!) – But still largely used – Drawbacks (amongst others): ● Local minimum ● K is a parameter!
  • 3. Determining the k in k-means with MapReduce 3 Clustering & k-means ● Determine k: – VERY difficult [Anil K Jain. Data Clustering : 50 Years Beyond K-Means. Pattern Recognition Letters, 2009] – Using cluster evaluation metrics: Dunn's index, Elbow, Silhouette, “jump method” (based on information theory), “Gap statistic”,... O(k²)
  • 4. Determining the k in k-means with MapReduce 4 G-means ● G-means [Greg Hamerly and Charles Elkan. Learning the k in k- means. In Neural Information Processing Systems. MIT Press, 2003] ● K-means : points in each cluster are spherically distributed around the center Source:scikit-learn
  • 5. Determining the k in k-means with MapReduce 5 G-means ● G-means [Greg Hamerly and Charles Elkan. Learning the k in k- means. In Neural Information Processing Systems. MIT Press, 2003] ● K-means : points in each cluster are spherically distributed around the center normality test & recursion
  • 6. Determining the k in k-means with MapReduce 6 G-means Dataset
  • 7. Determining the k in k-means with MapReduce 7 G-means 1. Pick 2 centers
  • 8. Determining the k in k-means with MapReduce 8 G-means 2. k-means
  • 9. Determining the k in k-means with MapReduce 9 G-means 3. Project
  • 10. Determining the k in k-means with MapReduce 10 G-means 3. Project
  • 11. Determining the k in k-means with MapReduce 11 G-means Normal? No => recursion 4. Normality test
  • 12. Determining the k in k-means with MapReduce 12 G-means 5. Recursion
  • 13. Determining the k in k-means with MapReduce 13 MapReduce G-means ● Challenges: 1. Reduce I/O operations 2. Reduce number of jobs 3. Maximize parallelism 4. Limit memory usage
  • 14. Determining the k in k-means with MapReduce 14 MapReduce G-means ● Challenges: 1. Reduce I/O operations 2. Reduce number of jobs 3. Maximize parallelism 4. Limit memory usage
  • 15. Determining the k in k-means with MapReduce 15 MapReduce G-means 2. Reduce number of jobs PickInitialCenters while Not ClusteringCompleted do KMeans KMeansAndFindNewCenters TestClusters end while
  • 16. Determining the k in k-means with MapReduce 16 MapReduce G-means TestClusters Map(key, point) Find cluster Find vector Project point on vector Emit(cluster, projection) end procedure Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster end if end procedure 3. Maximize parallelism 4. Limit memory usage
  • 17. Determining the k in k-means with MapReduce 17 MapReduce G-means TestClusters Map(key, point) Find cluster Find vector Project point on vector Emit(cluster, projection) end procedure Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster end if end procedure Bottleneck 3. Maximize parallelism 4. Limit memory usage (risk of crash)
  • 18. Determining the k in k-means with MapReduce 18 MapReduce G-means TestClusters Map(key, point) Find cluster Find vector Project point on vector Emit(cluster, projection) end procedure Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster end if end procedure TestFewClusters Map(key, point) Find cluster Find vector Project point on vector Add projection to list end procedure Close() For each list do Build a vector A2 = ADtest(vector) Emit(cluster, A2) End for each end procedure In memory combiner
  • 19. Determining the k in k-means with MapReduce 19 MapReduce G-means TestClusters Map(key, point) Find cluster Find vector Project point on vector Emit(cluster, projection) end procedure Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster end if end procedure TestFewClusters Map(key, point) Find cluster Find vector Project point on vector Add projection to list end procedure Close() For each list do Build a vector A2 = ADtest(vector) Emit(cluster, A2) End for each end procedure #clusters > #reducers & Estimated required memory < Java heap
  • 20. Determining the k in k-means with MapReduce 20 MapReduce G-means TestClusters Map(key, point) Find cluster Find vector Project point on vector Emit(cluster, projection) end procedure Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster end if end procedure TestFewClusters Map(key, point) Find cluster Find vector Project point on vector Add projection to list end procedure Close() For each list do Build a vector A2 = ADtest(vector) Emit(cluster, A2) End for each end procedure #clusters > #reducers & Estimated required memory < Java heap Experimentally: 64 Bytes / point
  • 21. Determining the k in k-means with MapReduce 21 Comparison MR multi-k-means MR G-means Speed Quality all possible values of k in a single job
  • 22. Determining the k in k-means with MapReduce 22 Comparison MR multi-k-means MR G-means Speed O(nk²) computations O(nk) computations But: ● more iterations ● more dataset reads ● log2 (k) Quality New centers added if and where needed But: tends to overestimate k!
  • 23. Determining the k in k-means with MapReduce 23 Experimental results : Speed ● Hadoop ● Synthetic dataset ● 10M points in R10 ● Euclidean distance ● 8 machines
  • 24. Determining the k in k-means with MapReduce 24 Experimental results : Quality ● Hadoop ● Synthetic dataset ● 10M points in R10 ● Euclidean distance ● 8 machines k 100 200 400 kfound 150 279 639 Within Cluster Sum of Square (less is better) MR G-means 3.34 3.33 3.23 multi-k-means 3.71 3.6 3.39 (with same k) x ~1.5
  • 25. Determining the k in k-means with MapReduce 25 Conclusions & future work... ● MapReduce algorithm to determine k ● Running time proportional to k ● Future: – Overestimation of k – Test on real data – Test scalability – Reduce I/O (using Spark) – Consider skewed data – Consider impact of machine failure
  • 26. Determining the k in k-means with MapReduce 26 Thank you!