The document discusses enhancing the K-Means clustering algorithm performance by converting it to a concurrent version using multi-threading. It identifies that steps 2 and 3 of the basic K-Means algorithm contain independent sub-tasks that can be executed in parallel. The implementation in C# uses the Parallel class to parallelize the processing. Analysis shows the concurrent version runs 70-87% faster with increasing performance gains at higher numbers of clusters and data points. Future work could parallelize the full K-Means algorithm.
4. Definition
• K-Means clustering is a method of cluster
analysis which aims to partition n
observations into k clusters in which each
observation belongs to the cluster with the
nearest mean.
5. Definition
• The K-Means problem is to find cluster centers
that minimize the sum of squared distances
from each data point being clustered to its
cluster center (the center that is closest to it).
• A very common measure is the sum of
distances or sum of squared Euclidean
distances from the mean of each cluster.
7. Step 1
The algorithm arbitrarily selects k
points as the initial cluster centers
(“means”).
8. Step 2
Each point in the dataset is assigned to
the closed cluster, based upon the
Euclidean distance between each point
and each cluster center.
9. Step 3
Each cluster center is recomputed as
the average of the points in that
cluster.
Steps 2 and 3 repeat until the clusters converge.
10. Convergence
Convergence means that either no
observations change clusters when
steps 2 and 3 are repeated or that the
changes do not make a material
difference in the definition of the
clusters
12. K-Means Algorithm Deficiencies
• The k-means algorithm has at least two major
theoretic shortcomings:
It has been shown that the worst case running
time of the algorithm is super-polynomial in the
input size.
The approximation found can be arbitrarily bad
with respect to the objective function compared
to the optimal clustering.
13. Our Work
Basic K-Means will be updated and
manipulated to a Concurrent K-Means
version that uses special .Net framework
libraries to take advantage of Multi-
threading Technology.
14. Our Work
This Concurrent version of K-Means
reserves all the benefits of Basic K-Means
and adds to it a much faster and
manipulated abilities that makes the
software runs as fast as 70%~85% more
than Basic K-Means.
16. First Step
First we must identify the Task
containing independent sub-tasks
that can be executed in parallel.
17. Identifying sub-Tasks
Consider the K-Means algorithm as follows:
1) Pick Random Center Points
2) Assign Points To Centers
3) Calculate New Centers
4) Check If Centers Are Equal
(if so, quit Else Go to 2)
19. Identifying sub-Tasks
In step 2, we are going to loop over every
point and determine which center is
closest to it. Since there is no state
modified during this lookup.
we can easily make this processes parallel.
20. Identifying sub-Tasks
In step 3, when we calculate new centers,
we are just going to loop over all of the
points in a given group and calculate their
“average” location (or centroid)
21. Identifying sub-Tasks
The Steps 2 and 3 are the best steps that
we can apply parallelism on them because
they are composed of independent loops
executed over the data points.
24. Basic K-Means algorithm
For step 3, all we need to do is loop through
each point and check every center until we find
the closest one.
If we weren’t concerned with writing a parallel
application then we could simple loop over
them with a normal foreach statement:
foreach (var point in Points){ //content goes here }
25. Concurrent K-Means algorithm
But if we leverage the
System.Threading.Tasks.Parallel class in .NET 4.0, we
could simply write this:
Parallel.ForEach(points, point =>
{ //contents goes here });
The same thing is repeated in step 4
30. Used machine
• Experiments are made under the following
machine:
• CPU = Intel(R) Xeon(R) X5690 @ 3.47 GHz/ 63.9
Gb of RAM
• Operating System = Microsoft Windows Server
2003 Enterprise X64 Edition Service Pack 2
• Number of Processors = 24
• Application Type = 64 bit
39. Results Analysis
• In case of K = 10, the results show that when
data points number is 5000 the algorithm is
enhanced by 44.98360984 % and this value
grows up to reach 62.85857193 % when data
points number is 50000 .
40. Results Analysis
• In case of K = 20, the results show that when
data points number is 5000 the algorithm is
enhanced by 70.81666934 % and this value
grows up to reach 76.9673324 % when data
points number is 50000 .
41. Results Analysis
• In case of K = 30, the results show that when
data points number is 5000 the algorithm is
enhanced by 77.94966857 % and this value
grows up to reach 82.57729344 % when data
points number is 50000 .
42. Results Analysis
• In case of K = 40, the results show that when
data points number is 5000 the algorithm is
enhanced by 82.97303981 % and this value
grows up to reach 87.01412419 % when data
points number is 50000.
43. Results Analysis
• If we regress the enhancement (Y) on the
number of means (X1) and the data points
(X2) we will have the equation below:
• Enhancement = 49.762 + 0.871 (Number of
means) + 0.0001 (Number of data points)
• R2 = 0.83251942
45. Conclusion
• This equation shows that there are a high
correlation between the Enhancement and
the number of means. We can say that when
the number of means is bigger, the
enhancement is better. We have this result
because we used parallel loops when looping
over clusters. So multi-threading is used more
when we have more means.
46. Future Work
In this project we have worked only on two tasks
in the K-Means algorithm (Steps 2 & 3).
In future works, we can work on converting the
whole algorithm into concurrent.