Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
06 how to write a map reduce version of k-means clustering
1. How to write a MapReduce Version of K-Means
Clustering
2. Recall
iterate {
Compute distance from all points to all k-centers
Assign each point to the nearest k-center
Compute the average of all points assigned to all specific k-centers
Replace the k-centers with the new averages
}
3. Recall: Parallelizing k-means
• In order to parallelize k-means, we want to come up with a scheme where
we can operate on each point in the data set independently.
• In the first step of the iterative process of k-means, it is necessary to
compute the distance from each point to each of the k cluster centers and
assign that point to the cluster with the minimum distance.
• Thus, there is a small amount of shared data – namely the cluster centers.
• However, this is small in comparison to the number of data points.
• So the parallelization scheme involves duplicating the cluster centers,
however once this is duplicated each data point can be operated on
independently of the others and we can gain a nice speedup.
4. K-means using MapReduce
• It is necessary to maintain a small amount of shared data, the cluster
centers.
• Thus when we partition points among MapReduce nodes, we also distribute
a copy of the cluster centers.
• This results in a small amount of data duplication, but very minimal.
• In this way each of the points can be operated on independently.
• Our map phase takes in points in the data set and outputs one (ClusterID,
Point) pair for each point, where the ClusterID is the integer ID of the cluster
which is closest to the point.
• During our reduce phase, the outputs of the map phase are grouped by
ClusterID, and for each ClusterID the centroid of the points associated with
that ClusterID is calculated.
• The output of our reduce phase are (ClusterID, Centroid) pairs, which
represent the newly calculated cluster centers.
• Each iteration of the algorithm is structured as a single MapReduce job,
driven by our library.
• After each phase, our library reads the output, determines whether
convergence has been reached by calculating by how much distance the
clusters have moved, and then runs another MapReduce job if necessary.
5. End of session
Day – 4: How to write a MapReduce Version of K-Means Clustering