SlideShare a Scribd company logo
1 of 23
Parallel K-means Clustering in
Erlang
Guide: Dr. Govindarajulu R.
Chinmay Patel - 201405627
Dharak Kharod - 201405583
Pawan Kumar - 201405637
The Clustering Problem
➢ Cluster analysis or clustering is the task of grouping a set of objects in such a
way that objects in the same group (called a cluster) are more similar (in
some sense or another) to each other than to those in other groups (clusters).
➢ Unsupervised learning
➢ NP hard problem
➢ Organizing data into clusters such that there is:
○ High intra-cluster similarity
○ Low inter-cluster similarity
○ Informally, finding natural grouping among objects
K-means Clustering Algorithm
➢ In k-means clustering, a set of n data points in d-dimensional space Rd
and an
integer k are given, and the problem is to determine a set of k points in Rd
,
called centers (means), so as to minimize the mean squared distance from
each data point to its nearest center.
➢ Fast, robust, easier to understand, partitional and non-hierarchical clustering
method.
➢ The k-means algorithm does not necessarily find the most optimal
configuration, corresponding to the global objective function minimum. The
algorithm is also significantly sensitive to the initial randomly selected cluster
centres.
Erlang Language
➢ Lightweight Concurrency
➢ Built in fault tolerance and asynchronous message passing
➢ No shared state
➢ Pattern matching
➢ Used at: Facebook chat service backend, Amazon SimpleDB
➢ Our experiments: Leader Election Algorithm, Chat server
Standard Algorithm
➢ Given an initial set of k means m1
(1)
,…,mk
(1)
, the algorithm proceeds by
alternating between two steps:
➢ Assignment step​: Assign each observation to the cluster whose mean yields
the least within cluster sum of squares (WCSS). Since the sum of squares is
the squared Euclidean distance, this is intuitively the "nearest" mean.
where each xp
is assigned to exactly one S(t)
.
Standard Algorithm
➢ Update step​: Calculate the new means to be the centroids of the observations
in the new clusters.
➢ Since the arithmetic mean is a least­ squares estimator, this also minimizes
the within­ cluster sum of squares (WCSS) objective.
➢ The algorithm has converged when the assignments no longer change. Since
both steps optimize the WCSS objective, and there only exists a finite number
of such partitionings, the algorithm must converge to a (local) optimum.
Naive Parallel K-means Clustering Algorithm
➢ In parallel version of K-means algorithm, the work of calculating means and
grouping data into clusters was divided among several nodes.
➢ Suppose, there are N worker nodes, then we divide our data-set into N
approximately equal subparts, and each subpart is sent to one worker node.
The server node sends initial set of K means to each worker node.
➢ Each worker node divides its own sublist into K clusters, depending upon K
means sent from the server node.
Naive Parallel K-means Clustering Algorithm
➢ After calculation of K sub-clusters, each worker node instead of sending
whole sub-cluster, sends sum of points of each K sub-clusters and count of
total points of each K sub-clusters. Then the server node calculates actual
mean of all clusters combined.
➢ Thus, this gives a set of new means which is again sent to each worker node,
and the process repeats till there is no change in means.
Improvised Parallel K-means Clustering
Algorithm
➢ The algorithm has to calculate the distance from each data object to every cluster
mean in each iteration. However, it is not necessary to calculate that distance
each time.
➢ The main idea of algorithm is to set two data structures to retain the labels of
cluster and the distance of all the data objects to the nearest cluster during the
each iteration.
➢ That can be used in next iteration, we calculate the distance between the current
data object and the new cluster mean, if the computed distance is smaller than or
equal to the distance to the old mean, the data object stays in its cluster that was
assigned to in previous iteration.
Introduction to KD Tree
➢ In each iteration algorithm boils down to calculating the nearest centroid for
every data point.
➢ If we can take the geometric arrangement of data in consideration we can
reduce the number of comparisons. This can be done using KD tree.
➢ KD tree is used to store spatial data and for nearest neighbour queries.
➢ A KD-Tree is a binary tree, where each node is associated with a disjoint
subset of the data
➢ KD trees are guaranteed log2
n depth where n is the number of points in the
set.
KD Tree Construction
➢ If there is just one point, form a leaf with that point.
➢ Otherwise, cycle through data dimension to select splitting plane.
➢ Split at median (to have a balanced tree)
➢ Continue recursively until both sides of the splitting plane are empty
➢ Each non-leaf represents a hierarchical subdivision of the data into two
hyperspaces using a hyperplane (splitting plane). The hyperplane is orthogonal
to each dimensional axis.
➢ Constructing the k-d tree can be done in O(dn log n) and O(n) storage. (d is the
dimension of data).
Example for 2D
Nearest Neighbour in KD Tree
➢ Query: given a kd-tree and a point in space , which point in the kd-tree is
closest to the test point?
➢ Given our guess of what the nearest neighbor is, we can make an observation
that if there is a point in this data set that is closer to the test point that our
current guess, it must lie in the circle centered at the test point that passes
through the current guess. (figure next slide).
➢ This lets us prune which parts of the tree might hold the true nearest neighbor.
➢ Runtime depends on data distribution but it has been shown to run in O(log n)
average time per search in a reasonable model. (Assuming d constant)
Example for 2D
Nearest Neighbour in KD Tree
NNS(q: point, n: node, p: point, w: distance){ // initial call NNS(q,root,p,infinity);
if n.left = null then {leaf case}
if distance(q,n.point) < w then return n.point else return p;
else
if w = infinity then
if q(n.axis) < n.value then
p := NNS(q,n.left,p,w);
w := distance(p,q);
if q(n.axis) + w > n.value then p := NNS(q, n.right, p, w);
else
p := NNS(q,n.right,p,w);
w := distance(p,q);
if q(n.axis) - w < n.value then p := NNS(q, n.left, p, w);
else //w is finite//
if q(n.axis) - w < n.value then
p := NNS(q, n.left, p, w);
w := distance(p,q);
if q(n.axis) + w > n.value then p := NNS(q, n.right, p, w);
return p
}
Parallel NN based K-means Clustering
Algorithm
➢ Master divides data points randomly and sends to worker nodes along with
initial centroids.
➢ At worker node:
○ Create KD tree out of the received centroids.
○ For each data point find the nearest centroid using the NNS algorithm. Assign it to
corresponding cluster
○ Sends results to master for each cluster.
➢ Master receives results for each cluster and calculates new means.
➢ Repeat until convergence.
Drawbacks of NNS based Algorithm
➢ As we make the nearest neighbour search complexity log(k) from k, if the k
value is small then there is not much difference. And usually that is the case
so speed up that we get is low.
➢ We are making KD tree of centroids and then we perform nearest neighbour
so we need to build the KD tree for every iteration.
➢ If some method is based on making KD tree out of data points then we can
save a lot of time. This is the basis of the filtering algorithm.
The Filtering Algorithm
➢ In kd tree a bounding box B refers to the smallest box (an axis-aligned
hyper-rectangle) that contains all of the respective points.
➢ Each node in the kd-tree has an associated set of potential candidate centers.
All k centers are candidate centers for the root.
➢ For each node, the candidate center z0
that is closest to the midpoint of B is
first computed and added to the set Z then we filter each impossible
candidate center z.
➢ That is if no part of B is closer to each new candidate center than z0
, they are
not added to that node’s candidate center set Z
The Filtering Algorithm
➢ Comparison of distances between any part of B
and z we compare distances from the vertices of B
which basically maximizes the distances for all
points in B from the given z.
➢ If there are more than one candidate center for a
leaf node then again usual distance computation
has to be done.
➢ Pre-processed data is stored in kd tree nodes.
➢ Data sensitive algorithm.
The filtering algorithm
Filter(kdNode u, CandidateSet Z) {
C <- u.cell;
If ( u is a leaf ) {
z* <- the closest point in Z to u.point;
z*.wgtCent <- z*.wgtCent + u.point;
z*.count <- z*.count +1;
}
Else {
z* <- the closest point in Z to C’s midpoint;
For each ( z belongs to Z  {z*} )
If ( z.isFarther(z*,C) ) Z <- Z  {z} ;
If ( |Z| =1 ) {
z*.wgtCent <- z*.wgtCent + u.wgtCent;
z*.count <- z*.count + u.count;
}
Else {
Filter (u.left,Z);
Filter ( u.right,Z);
}
}
}
Parallel filtering based K-means algorithm
➢ Master makes kd tree of logn height where n is the number of worker node
and sends the nodes at leaves to worker nodes along with initial centroids.
➢ At worker node:
○ Create KD tree out of the received data points. (Done only once)
○ Run filtering algorithm.
○ Sends results to master for each cluster.
➢ Master receives results for each cluster and calculates new means.
➢ Repeat until convergence.
Results and Conclusion
➢ Clustering problem NP hard. Approximations such as Lloyd’s algorithm are
much more computationally attractive approximations that converge to a
local optimum.
➢ In comparison of sequential, parallel algorithm improves performance by far.
➢ In the dataset with relatively less data points but higher number of clusters,
nearest neighbour algorithm performs way better.
➢ If data is well-separated then filtering algorithm gives overall best
performance. Otherwise the performance is as good as normal parallel
k-means.
Thank You.

More Related Content

What's hot

Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewVahid Mirjalili
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmEditor IJCATR
 
Pattern Mining in large time series databases
Pattern Mining in large time series databasesPattern Mining in large time series databases
Pattern Mining in large time series databasesJitesh Khandelwal
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxShwetapadmaBabu1
 
DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)Cory Cook
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in RSudhakar Chavan
 
K means and dbscan
K means and dbscanK means and dbscan
K means and dbscanYan Xu
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithmparry prabhu
 
Chapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningChapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningHouw Liong The
 
Birch Algorithm With Solved Example
Birch Algorithm With Solved ExampleBirch Algorithm With Solved Example
Birch Algorithm With Solved Examplekailash shaw
 

What's hot (20)

Clustering: A Survey
Clustering: A SurveyClustering: A Survey
Clustering: A Survey
 
Lect4
Lect4Lect4
Lect4
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 
Clustering
ClusteringClustering
Clustering
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
 
Clique and sting
Clique and stingClique and sting
Clique and sting
 
Pattern Mining in large time series databases
Pattern Mining in large time series databasesPattern Mining in large time series databases
Pattern Mining in large time series databases
 
Icmtea
IcmteaIcmtea
Icmtea
 
Birch
BirchBirch
Birch
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 
K means and dbscan
K means and dbscanK means and dbscan
K means and dbscan
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
Dataa miining
Dataa miiningDataa miining
Dataa miining
 
Chapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningChapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text mining
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
Birch Algorithm With Solved Example
Birch Algorithm With Solved ExampleBirch Algorithm With Solved Example
Birch Algorithm With Solved Example
 
K mean-clustering
K mean-clusteringK mean-clustering
K mean-clustering
 
K means
K meansK means
K means
 

Similar to Parallel K-means Clustering in Erlang Using KD Trees and Filtering Algorithm

Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetAlaaZ
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clusteringArshad Farhad
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningPyingkodi Maran
 
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures Intel® Software
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methodsKrish_ver2
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3Nandhini S
 
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...IJERA Editor
 
ANLY 501 Lab 7 Presentation Group 8 slide.pptx
ANLY 501 Lab 7 Presentation Group 8 slide.pptxANLY 501 Lab 7 Presentation Group 8 slide.pptx
ANLY 501 Lab 7 Presentation Group 8 slide.pptxrinehi3578
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Zihui Li
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationGeoffrey Fox
 
An improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyAn improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyijpla
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAprithan
 

Similar to Parallel K-means Clustering in Erlang Using KD Trees and Filtering Algorithm (20)

Data analysis of weather forecasting
Data analysis of weather forecastingData analysis of weather forecasting
Data analysis of weather forecasting
 
DBSCAN
DBSCANDBSCAN
DBSCAN
 
Knn 160904075605-converted
Knn 160904075605-convertedKnn 160904075605-converted
Knn 160904075605-converted
 
Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial Dataset
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine Learning
 
Db Scan
Db ScanDb Scan
Db Scan
 
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
 
UNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptxUNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptx
 
Project PPT
Project PPTProject PPT
Project PPT
 
k-mean-clustering.pdf
k-mean-clustering.pdfk-mean-clustering.pdf
k-mean-clustering.pdf
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
An Efficient Method of Partitioning High Volumes of Multidimensional Data for...
 
ANLY 501 Lab 7 Presentation Group 8 slide.pptx
ANLY 501 Lab 7 Presentation Group 8 slide.pptxANLY 501 Lab 7 Presentation Group 8 slide.pptx
ANLY 501 Lab 7 Presentation Group 8 slide.pptx
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
50120140505013
5012014050501350120140505013
50120140505013
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
 
An improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracyAn improvement in k mean clustering algorithm using better time and accuracy
An improvement in k mean clustering algorithm using better time and accuracy
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 

Recently uploaded

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Recently uploaded (20)

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

Parallel K-means Clustering in Erlang Using KD Trees and Filtering Algorithm

  • 1. Parallel K-means Clustering in Erlang Guide: Dr. Govindarajulu R. Chinmay Patel - 201405627 Dharak Kharod - 201405583 Pawan Kumar - 201405637
  • 2. The Clustering Problem ➢ Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). ➢ Unsupervised learning ➢ NP hard problem ➢ Organizing data into clusters such that there is: ○ High intra-cluster similarity ○ Low inter-cluster similarity ○ Informally, finding natural grouping among objects
  • 3. K-means Clustering Algorithm ➢ In k-means clustering, a set of n data points in d-dimensional space Rd and an integer k are given, and the problem is to determine a set of k points in Rd , called centers (means), so as to minimize the mean squared distance from each data point to its nearest center. ➢ Fast, robust, easier to understand, partitional and non-hierarchical clustering method. ➢ The k-means algorithm does not necessarily find the most optimal configuration, corresponding to the global objective function minimum. The algorithm is also significantly sensitive to the initial randomly selected cluster centres.
  • 4. Erlang Language ➢ Lightweight Concurrency ➢ Built in fault tolerance and asynchronous message passing ➢ No shared state ➢ Pattern matching ➢ Used at: Facebook chat service backend, Amazon SimpleDB ➢ Our experiments: Leader Election Algorithm, Chat server
  • 5. Standard Algorithm ➢ Given an initial set of k means m1 (1) ,…,mk (1) , the algorithm proceeds by alternating between two steps: ➢ Assignment step​: Assign each observation to the cluster whose mean yields the least within cluster sum of squares (WCSS). Since the sum of squares is the squared Euclidean distance, this is intuitively the "nearest" mean. where each xp is assigned to exactly one S(t) .
  • 6. Standard Algorithm ➢ Update step​: Calculate the new means to be the centroids of the observations in the new clusters. ➢ Since the arithmetic mean is a least­ squares estimator, this also minimizes the within­ cluster sum of squares (WCSS) objective. ➢ The algorithm has converged when the assignments no longer change. Since both steps optimize the WCSS objective, and there only exists a finite number of such partitionings, the algorithm must converge to a (local) optimum.
  • 7. Naive Parallel K-means Clustering Algorithm ➢ In parallel version of K-means algorithm, the work of calculating means and grouping data into clusters was divided among several nodes. ➢ Suppose, there are N worker nodes, then we divide our data-set into N approximately equal subparts, and each subpart is sent to one worker node. The server node sends initial set of K means to each worker node. ➢ Each worker node divides its own sublist into K clusters, depending upon K means sent from the server node.
  • 8. Naive Parallel K-means Clustering Algorithm ➢ After calculation of K sub-clusters, each worker node instead of sending whole sub-cluster, sends sum of points of each K sub-clusters and count of total points of each K sub-clusters. Then the server node calculates actual mean of all clusters combined. ➢ Thus, this gives a set of new means which is again sent to each worker node, and the process repeats till there is no change in means.
  • 9. Improvised Parallel K-means Clustering Algorithm ➢ The algorithm has to calculate the distance from each data object to every cluster mean in each iteration. However, it is not necessary to calculate that distance each time. ➢ The main idea of algorithm is to set two data structures to retain the labels of cluster and the distance of all the data objects to the nearest cluster during the each iteration. ➢ That can be used in next iteration, we calculate the distance between the current data object and the new cluster mean, if the computed distance is smaller than or equal to the distance to the old mean, the data object stays in its cluster that was assigned to in previous iteration.
  • 10. Introduction to KD Tree ➢ In each iteration algorithm boils down to calculating the nearest centroid for every data point. ➢ If we can take the geometric arrangement of data in consideration we can reduce the number of comparisons. This can be done using KD tree. ➢ KD tree is used to store spatial data and for nearest neighbour queries. ➢ A KD-Tree is a binary tree, where each node is associated with a disjoint subset of the data ➢ KD trees are guaranteed log2 n depth where n is the number of points in the set.
  • 11. KD Tree Construction ➢ If there is just one point, form a leaf with that point. ➢ Otherwise, cycle through data dimension to select splitting plane. ➢ Split at median (to have a balanced tree) ➢ Continue recursively until both sides of the splitting plane are empty ➢ Each non-leaf represents a hierarchical subdivision of the data into two hyperspaces using a hyperplane (splitting plane). The hyperplane is orthogonal to each dimensional axis. ➢ Constructing the k-d tree can be done in O(dn log n) and O(n) storage. (d is the dimension of data).
  • 13. Nearest Neighbour in KD Tree ➢ Query: given a kd-tree and a point in space , which point in the kd-tree is closest to the test point? ➢ Given our guess of what the nearest neighbor is, we can make an observation that if there is a point in this data set that is closer to the test point that our current guess, it must lie in the circle centered at the test point that passes through the current guess. (figure next slide). ➢ This lets us prune which parts of the tree might hold the true nearest neighbor. ➢ Runtime depends on data distribution but it has been shown to run in O(log n) average time per search in a reasonable model. (Assuming d constant)
  • 15. Nearest Neighbour in KD Tree NNS(q: point, n: node, p: point, w: distance){ // initial call NNS(q,root,p,infinity); if n.left = null then {leaf case} if distance(q,n.point) < w then return n.point else return p; else if w = infinity then if q(n.axis) < n.value then p := NNS(q,n.left,p,w); w := distance(p,q); if q(n.axis) + w > n.value then p := NNS(q, n.right, p, w); else p := NNS(q,n.right,p,w); w := distance(p,q); if q(n.axis) - w < n.value then p := NNS(q, n.left, p, w); else //w is finite// if q(n.axis) - w < n.value then p := NNS(q, n.left, p, w); w := distance(p,q); if q(n.axis) + w > n.value then p := NNS(q, n.right, p, w); return p }
  • 16. Parallel NN based K-means Clustering Algorithm ➢ Master divides data points randomly and sends to worker nodes along with initial centroids. ➢ At worker node: ○ Create KD tree out of the received centroids. ○ For each data point find the nearest centroid using the NNS algorithm. Assign it to corresponding cluster ○ Sends results to master for each cluster. ➢ Master receives results for each cluster and calculates new means. ➢ Repeat until convergence.
  • 17. Drawbacks of NNS based Algorithm ➢ As we make the nearest neighbour search complexity log(k) from k, if the k value is small then there is not much difference. And usually that is the case so speed up that we get is low. ➢ We are making KD tree of centroids and then we perform nearest neighbour so we need to build the KD tree for every iteration. ➢ If some method is based on making KD tree out of data points then we can save a lot of time. This is the basis of the filtering algorithm.
  • 18. The Filtering Algorithm ➢ In kd tree a bounding box B refers to the smallest box (an axis-aligned hyper-rectangle) that contains all of the respective points. ➢ Each node in the kd-tree has an associated set of potential candidate centers. All k centers are candidate centers for the root. ➢ For each node, the candidate center z0 that is closest to the midpoint of B is first computed and added to the set Z then we filter each impossible candidate center z. ➢ That is if no part of B is closer to each new candidate center than z0 , they are not added to that node’s candidate center set Z
  • 19. The Filtering Algorithm ➢ Comparison of distances between any part of B and z we compare distances from the vertices of B which basically maximizes the distances for all points in B from the given z. ➢ If there are more than one candidate center for a leaf node then again usual distance computation has to be done. ➢ Pre-processed data is stored in kd tree nodes. ➢ Data sensitive algorithm.
  • 20. The filtering algorithm Filter(kdNode u, CandidateSet Z) { C <- u.cell; If ( u is a leaf ) { z* <- the closest point in Z to u.point; z*.wgtCent <- z*.wgtCent + u.point; z*.count <- z*.count +1; } Else { z* <- the closest point in Z to C’s midpoint; For each ( z belongs to Z {z*} ) If ( z.isFarther(z*,C) ) Z <- Z {z} ; If ( |Z| =1 ) { z*.wgtCent <- z*.wgtCent + u.wgtCent; z*.count <- z*.count + u.count; } Else { Filter (u.left,Z); Filter ( u.right,Z); } } }
  • 21. Parallel filtering based K-means algorithm ➢ Master makes kd tree of logn height where n is the number of worker node and sends the nodes at leaves to worker nodes along with initial centroids. ➢ At worker node: ○ Create KD tree out of the received data points. (Done only once) ○ Run filtering algorithm. ○ Sends results to master for each cluster. ➢ Master receives results for each cluster and calculates new means. ➢ Repeat until convergence.
  • 22. Results and Conclusion ➢ Clustering problem NP hard. Approximations such as Lloyd’s algorithm are much more computationally attractive approximations that converge to a local optimum. ➢ In comparison of sequential, parallel algorithm improves performance by far. ➢ In the dataset with relatively less data points but higher number of clusters, nearest neighbour algorithm performs way better. ➢ If data is well-separated then filtering algorithm gives overall best performance. Otherwise the performance is as good as normal parallel k-means.