K-Means clustering is a popular clustering algorithm in data mining. Clustering large data sets can be
time consuming, and in an attempt to minimize this time, our project is a parallel implementation of KMeans
clustering algorithm on CUDA using C. We present the performance analysis and implementation
of our approach to parallelizing K-Means clustering.
Parallel Implementation of K Means Clustering on CUDA
1. CS 240A - Parallel Implementation of K Means Clustering on
CUDA
Lan Liu, Pritha D N
December 9, 2016
Abstract
K-Means clustering is a popular clustering algorithm in data mining. Clustering large data sets can be
time consuming, and in an attempt to minimize this time, our project is a parallel implementation of K-
Means clustering algorithm on CUDA using C. We present the performance analysis and implementation
of our approach to parallelizing K-Means clustering.
1 K-Means Clustering
In this section, we provide an overview of K-Means clustering, the mathematical description and the
sequential algorithm has been presented, and the complexity of sequential code has been analyzed.
1.1 Description
K-Means clustering is one of the most widely used clustering method used in data mining, it aims to partition
N given data points into K clusters, in which each cluster has the most similarity, namely, each data point
belongs to the cluster with the nearest mean.
Assume there are N data points in d dimension, (x1, x2, ...., xn) ⊂ Rd, and we need to classify them into
K clusters S = (S1, S2, ..., SK ), where K is generally fixed apriori. Define the center of each cluster Si as the
mean of all data points x ∈ Si, i.e.
µi
x∈Si
x
|Si |
, where |Si | denotes size of Si
The goal of clustering is to minimize the total Euclidean distance from each data point to its cluster center,
i.e. find clustering S to minimize the objective function:
cost(S) =
k
i=1 x∈Si
||x − µi ||2
Finding the global minimum of the objective function is computationally challenging (NP-hard). The
commonly used algorithm is really a heuristic which can find a local minimum instead of global minimum,
and the trade off is that computational cost is cheaper. The commonly used k-means algorithm is as follows:
Step1: Initialize cluster centroids µ1, µ2, ..., µk ∈ Rn randomly.
Step2: Repeat until convergence:{
Assignment step: In order to assign each data point xi to the nearest cluster Si, define ci as membership
index of xi,
ci
:= argmink
j=1||xi − µj ||2
Update centroids step: For each j, update
µj :=
N
i=1 1ci =j xi
N
i=1 1ci =j
}
Since generally the clustering problem is not convex, there might be a lot of local minimum. For any
fixed initial condition, It can be easily proved that cost(S) decreases for every iteration step, and thus the
1
2. algorithm converges to a unique local minimum depending on which initial condition is given. For example,
set x = (2, 4, 5, 6), if initial centers are µ1 = 2, µ2 = 5, then S1 = (2), S2 = (4, 5, 6), if initial centers are
µ1 = 4, µ2 = 5, then S1 = (2, 4), S2 = (5, 6), the first is global minimal with cost=2, and the second is a
saddle with cost=2.5.
Consider this, people might generate a couple of result using random initial centers, and choose the best
with smallest objective function. And this also drives us to apply parallel computing to save running time.
1.2 Algorithm
Based on the description, we apply the following heuristic K Means clustering sequential algorithm, this
algorithm is based on the Professor Wei-keng Liao’s k-means clustering code [1]. We have modified the
code about how to recompute the new cluster center. In that code it sumerizes all data points in the new
cluster to compute center in every iteration step, considering that only a portion of data changes membership
(and being less and less as iteration goes), we instead treat the changing data by adding the data into new
cluster and removing it from old cluster, this will be more efficient and the result also verifies this.
Step 1: Pick the first K data points as initial cluster centers.
Step2: Attribute each data point to the nearest cluster.
Step3: For each reassigned data point, membership change increase by 1.
Step4: Set the position of each new cluster to be the mean of all data points belonging to that cluster.
Step 5: Repeat steps 2-4 until convergence.
The pseudo code is as follows:
Let N be the number of data points, K be the number of clusters.
data[N]: the array of data objects
center[K]: the array of cluster centers
membership[N]: the array of data point membership.
clustersum[K]: the sum of data points in Kth cluster.
clustersize[K]: the size of Kth cluster.
δ: count the number of membership change.
threshold: critical value to define stop condition, we set it be 0.001.
for i from 0 to K-1
center[i] ←− data[i]
do{
δ ←− 0
for i from 0 to N − 1
mindis=||data[i]-center[0]||
for j from 1 to K − 1
distance=||data[i]-center[j]||
if distance<mindis
mindis←−distance
index←−j
if first iteration
δ ←−N
membership[i]←−index
clustersize[index]←−clustersize[index]+1
clustersum[index]←−clustersum[index]+data[i]
else if membership[i] index
δ = δ + 1
clustersize[index]←−clustersize[index]+1
clustersize[membership[i]]←−clustersize[membership[i]]-1
clustersum[index]←−clustersum[index]+data[i]
clustersum[membership[i]]←−clustersum[membership[i]]-data[i]
membership[i]←−index
for j from 0 to K − 1
center[j]←−clustersum[j]/clustersize[j]
} While(δ/N>threshold)
2
3. Note: stop condition is δ/N<threshold, i.e., the number of membership changes is 1‰of all datas.
Complexity Analysis: The sequential code has complexity O(TKND), where T is the number of iterations,
K is number of clusters, N is number of data points, and D is the dimension of each data point. There are
two main parts in the code, the first part is to reassign each data point to the nearest center, this requires
to compute the distance between each data point with each cluster center for N data points, and thus the
complexity is O(NKD) for each iteration step. The second part is to compute the center for each new cluster
after the reassignment, it basically requires to compute K groups of means for N datas, and the complexity
is O((N + K) ∗ D) for each iteration step. Apparently, part 1 is the dominant time consuming part, and since
part 1 is an independent process for each data point, this inspires us to parallelize part1. The platform of
parallel code can ba MPI, CILK, OPENMP, CUDA, we will use CUDA on comet to do the parallelization
because of the high efficency of GPU processing large scale data.
2 Parallelization Of K-Means Using CUDA
In this section, we first introduce CUDA and GPUs nodes on Comet, and secondly discuss parallelization
strtegies and CUDA implementation on comet, the last part is to use CUDA Occupancy Calculator to
determine the optimal number of Threads per Block.
2.1 CUDA
CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model
invented by NVIDIA. It enables dramatic increases in computing performance by harnessing the power of
the graphics processing unit (GPU).
The graphics processing unit (GPU), as a specialized computer processor, addresses the demands of
real-time high-resolution 3D graphics compute-intensive tasks. GPUs have evolved into highly parallel
multi-core systems allowing very efficient manipulation of large blocks of data. In the computer game
industry, GPUs are used for graphics rendering, and for game physics calculations (physical effects such as
debris, smoke, fire, fluids); examples include PhysX and Bullet. CUDA has also been used to accelerate
non-graphical applications in computational biology, cryptography and other fields by an order of magnitude
or more.
CUDA works with all Nvidia GPUs from the G8x series onwards, including GeForce, Quadro and the
Tesla line.
2.1.1 NVIDIA GPUs on Comet
The Comet infrastructure provides 36 GPU Nodes. It contains the NVIDIA K80 GPUs. This series is
popularly called the Tesla GPU series by NVIDIA.
GPUs 2 NVIDIA K-80
Cores or socket 12
Sockets 2
Clock speed 2.5 GHz
Memory capacity 128 GB DDR4 DRAM
Memory bandwidth 120 GB/s
Flash memory 320 GB
Table 1: GPU node in comet
Figure 1: Left is how CPU process data, right is how GPU process data. In CPU, each core will process n/p
dates, and in GPU each thread get access to one single data.
3
4. Nvidia Tesla is Nvidia’s brand name for their products targeting stream processing and/or general purpose
GPU.
Tesla is Nvidia’s first microarchitecture to implement unified shaders. It was used with GeForce 8 Series,
GeForce 9 Series, GeForce 100 Series, GeForce 200 Series, and GeForce 300 Series of GPUs manufactured
in 90 nm, 80 nm, 65 nm, and 55 nm. It also found use in the GeForce 405, and in the workstation market in
the Quadro FX, Quadro x000, Quadro NVS series, and Nvidia Tesla computing modules.
With their very high computational power (measured in floating point operations per second or FLOPS)
compared to microprocessors, the Tesla products target the high performance computing market.
Physical Limits for GPU Compute Capability=3.7 (Tesla X80) are in Table 2.
Threads per Warp 32
Max Warps per Multiprocessor 64
Max Thread Blocks per Multiprocessor 16
Max Threads per Multiprocessor 2048
Maximum Thread Block Size 1024
Registers per Multiprocessor 131072
Max Registers per Thread Block 65536
Max Registers per Thread 255
Shared Memory per Multiprocessor (bytes) 114688
Max Shared Memory per Block 49152
Register allocation unit size 256
Register allocation granularity warp
Shared Memory allocation unit size 256
Warp allocation granularity 4
Table 2: Physical limits for GPU Compute Capability=3.7 (Tesla X80)
Figure 2: Thread Organization in CUDA
2.2 Parallelization of K-Means clustering on CUDA
CUDA uses CPU as its HOST, and GPU as its DEVICE. Each GPU node can get access to thousands of
threads, and each thread is processing one single data. The threads are grouped into block and shared
memory is restricted to each block. HOST and DEVICE do not share memory. Under this configuration,
we would have to mannully communicate message between HOST and DEVICE.
As explained in the last part of sec 1.2, we aim to parallelize the reassignment step for computing
distance between each data point and each cluster center. The logic and order of parallel algorithm is totally
the same with original sequantial algorithm, and we have to take into account the communication between
HOST and DEVICE in parallel algorithm:
Step0: HOST initialize cluster centers, copy N data coordinates to DEVICE.
Step1: DEVICE copy data membership and K cluster centers from HOST.
Step2: In DEVICE, each thread process a single data point, compute the distance between each cluster
center and update data membership. tid=blockDim.x * blockIdx.x + threadIdx.x.
Step3: HOST Copy the new data membership from DEVICE, and recompute cluster centers.
Step4: Repeat step 1-3 if not converges, go to step 5 if converges.
Step5: Host free the allocated memory.
There are several crucial points in parallel code, first, it is easier to handle 1D array than 2D array for
threads in GPU, thus we convert N data points(in D dimension) from a 2D array to a 1D array in HOST
4
5. and then send it to DEVICE. i.e. DEVICEdata[i*numCoordinates+j]=HOSTdata[i][j] (the jth coordinate of
the ith data point). Secondly, since different blocks do not share memory, we have to reduce the number of
membership change in each block to compute the total number of membership change.
In our implementation, we set: NumberThreadsPerClusterBlock=128
NumClusterBlocks=
(N+numThreadsPerClusterBlock − 1)
numThreadsPerClusterBlock
The correctness of the parallel algorithm is guaranteed, measured by that it produces the same clustering
as the original sequential k-means algorithm. Our implementation performs the same steps as the sequential
code in parallel without changing the logic, thus the correctness is expected.
2.3 Determining Optimal Number of Threads per Block
We used the CUDA Occupancy Calculator provided by NVIDIA to determine the optimal number of
threads per block.
2.3.1 Code Analysis
To determine the resource usage for each of the CUDA threads for our nearest cluster determining kernel,
we compiled our code with the ptxas nvcc option. The following is the ouput of the compilation:
ptxas i n f o : 0 bytes gmem
nvcc −g −pg −I . −DBLOCK_SHARED_MEM_OPTIMIZATION=0 −−ptxas −o p t i o n s =−v
−o cuda_kmeans . o −c cuda_kmeans . cu
ptxas i n f o : 0 bytes gmem
ptxas i n f o : Compiling e n t r y f u n c t i o n ’ _ Z 2 0 f i n d _ n e a r e s t _ c l u s t e r i i i P f S _ P i S 0 _ ’
f o r ’sm_20 ’
ptxas i n f o : Function p r o p e r t i e s f o r _ Z 2 0 f i n d _ n e a r e s t _ c l u s t e r i i i P f S _ P i S 0 _
0 bytes s t a c k frame , 0 bytes s p i l l s t o r e s , 0 bytes s p i l l loads
ptxas i n f o : Used 18 r e g i s t e r s , 80 bytes cmem[ 0 ]
2.3.2 CUDA Occupancy Calculator
The CUDA Occupancy Calculator [3] allows you to compute the multiprocessor occupancy of a GPU by a
given CUDA kernel. The multiprocessor occupancy is the ratio of active warps to the maximum number of
warps supported on a multiprocessor of the GPU. Each multiprocessor on the device has a set of N registers
available for use by CUDA thread programs. These registers are a shared resource that are allocated among
the thread blocks executing on a multiprocessor. The CUDA compiler attempts to minimize register usage
to maximize the number of thread blocks that can be active in the machine simultaneously. If a program
tries to launch a kernel for which the registers used per thread times the thread block size is greater than N,
the launch will fail.
Maximizing the occupancy can help to cover latency during global memory loads that are followed by a
__syncthreads(). The occupancy is determined by the amount of shared memory and registers used by each
thread block. Because of this, programmers need to choose the size of thread blocks with care in order to
maximize occupancy. This GPU Occupancy Calculator can assist in choosing thread block size based on
shared memory and register requirements.
For any input size, the shared memory used by our program is null. We use the CUDA Occupancy
Calculator with Number of threads per block = 128. From the dis-assembly of code, we find that 18
registers are required by the kernel function we have. We provide the Compute Capacity - 3.7 (GK210,
X80), number of threads per block - 128, shared memory size - 112KB (for 3.7) and number of registers
required per thread = 18, as input to the occupancy calculator. We indicate the results in the figures included.
Number of threads per block = 128 gave maximum occupancy for our program.
5
6. Figure 3: Input to CUDA Occupancy Calculator
Figure 4: Output of CUDA Occupancy Calculator
Figure 5: Impact of varying Block Size
Figure 6: Impact of varying Register Count Per Thread
Figure 7: Impact of varying Shared Memory Usage Per Block
6
7. 3 Parallel Performance Analysis
In this section, we present several experiment result to reveal the parallel performance.
1. Experiment1: Vary the size of data set N, fix number of clusters K=128. The performance result
is on table 2 and figure 9. Data dimension is 1000.
For K=128 fixed, as the size of data sets N increase, parallel speed up is around 40 and increase
gradually. The memory capacity for GPU is 11GB in comet, and we have tested for 8GB data set
(2048000*1000), it turns out that the parallel running time is 6175sec (about 1.7 hour), the sequential
code is too slow to get the time, and we expect the time to be around 3 days. The parallel code
outperforms the sequential code a lot.
Size(float values) Sequential(in sec) Parallel (in sec) Speed up
51200*1000 463.09 11.73 39.5
76800*1000 857.73 19.25 44.55
89600*1000 1182.43 24.82 47.6
115200*1000 1676.96 35.22 47.6
128000*1000 1794.91 41.23 43.53
512000*1000 >4hrs 405.56 NA
2048000*1000 > 6174.72 -
Table 3: Experiment1. Parallel Performance when varying size, fix K=128
Figure 8: Experiment1. Parallel Speedup versus size of data set. K=128
2. Experiment2: For fixed size=51200*1000, vary the number of clusters, use δ/N < 0.001 as stop
condition. We have tested for K=4, 16, 64, 128, 256, 512, 1024, 2048, and the result is in table 3 and
figure 10, 11.
Figure 10 presents that parallel speed up keeps increasing as K increase, and the derivative of the
curve decrease. This matches with our expectation. The computational cost of part 1 in each iteration
for sequential code is O(N*D*K), of part2 is O((N+K)*D), in parallel code, only part1 has been
parallelized. After running T iterations, the speed up will be:
t1
tp
=
O(NDKT) + O((N + K)DT)
parallized + O((N + K)DT)
As K being larger, since N >> K, D > K, and N, D are both fixed, the time consuming of part2 is
steady, and the larger K is, the more speedup will be earned by part1. Thus overall the speed up
increases. Figure 11 shows that as K increase, for fixed N, the number of iteration needed to converge
decrease, this drives down the parallel running time even K increase at the first stage.
7
8. K(num of clusters) numofiteration Sequential(in sec) Parallel (in sec) Speed up
4 71 63.81 18.04 3.54
16 51 155.64 15.06 10.33
64 29 338.55 10.51 32.20
128 20 463.38 11.06 41.88
256 16 739.15 12.30 60.11
512 12 1105.98 13.14 84.14
1024 10 1842.00 18.98 97.04
2048 6 2207.78 21.19 104.17
Table 4: Experiment2.Performance when varying number of clusters K for fixed data size
Figure 9: Experiment2.Speed up versus number of clusters K for fixed data size
Figure 10: Experiment2.Parallel running time, numofiteration versus number of clusters K for fixed data
size
8
9. 3. Experiment3: For fixed size=51200*1000, vary the number of clusters, fix number of itera-
tion=30. Tested for K=4, 16, 64, 128, 256. The result is in table 5 and Figure 12,13.
As in Experiment2, we get outstanding speed up and the speedup goes up as K goes up, as shown in
figure 12. Experiment3 reveal another scaling fact of the code, in figure 13 we plot the relationship
between log2(K) with log2(running time), for sequential case, it is very close to a straight line
with slope 1, and this coincide with the face that as K double, sequential running time will double,
considering the complexity to be O(NDKT) + O((N + K)DT, and for parallel case, it is a curve with
increasing slope which is always less than 1, and the fitting slope is 0.28. This implies a fact that the
parallelism is larger as K being larger, and gained more speedup.
K(num of clusters) Sequential(in sec) Parallel (in sec) Speed up
4 27.82 7.72 3.60
8 50.03 7.89 6.34
16 94.50 8.33 11.34
32 183.56 9.50 19.32
64 361.92 12.31 29.40
128 718.31 15.91 45.15
256 1430.74 21.76 65.76
512 2858.05 32.47 88.01
Table 5: Experiment3. Performance when changing K, fix numiteration=30
Figure 11: Experiment3.Speedup versus number of clusters K for fixed data size and fixed iterations
Figure 12: Experiment3: Rate of growth of sequential and parallel implementations
9
10. 3.1 Profiling
nvprof[2] presents an overview of the GPU kernels and memory copies in our program. The summary,
groups all calls to the same kernel together, presenting the total time and percentage of the total application
time for each kernel. In addition to summary mode, nvprof supports GPU-Trace and API-Trace modes that
let you see a complete list of all kernel launches and memory copies, and in the case of API-Trace mode, all
CUDA API calls.
We perform 4 mallocs - one for the input 2D data, one for 2D cluster data, a 1D membership and one 1D
membership changed array. In every iteration, we perform two Device-Host copies for - membership and
membershipChanged and one Host-Device copy, copying the new cluster centers.
==180922== NVPROF i s p r o f i l i n g proces s 180922 , command : . / t e s t _ d r i v e r
==180922== P r o f i l i n g a p p l i c a t i o n : . / t e s t _ d r i v e r
N = 51200
dimension = 1000
k = 128
t h r e s h o l d = 0.0010
Type : P a r a l l e l
Computation timing = 11.8963 sec
Loop i t e r a t i o n s = 21
==180922== P r o f i l i n g r e s u l t :
Time(%) Time C a l l s Avg Min Max Name
98.99% 4.11982 s 21 196.18ms 195.58ms 197.96ms
f i n d _ n e a r e s t _ c l u s t e r ( int , int , int , f l o a t ∗ , f l o a t ∗ , i n t ∗ , i n t ∗)
0.98% 40.635ms 23 1.7668ms 30.624 us 39.102ms [CUDA memcpy HtoD ]
0.03% 1.2578ms 42 29.946 us 28.735 us 31.104 us [CUDA memcpy DtoH ]
==180922== API c a l l s :
Time(%) Time C a l l s Avg Min Max Name
93.06% 4.12058 s 21 196.22ms 195.62ms 198.00ms cudaDeviceSynchronize
5.79% 256.47ms 4 64.117ms 4.9510 us 255.97ms cudaMalloc
1.02% 45.072ms 65 693.42 us 82.267 us 39.230ms cudaMemcpy
0.06% 2.6048ms 332 7.8450 us 528 ns 282.45 us c u D e vi c e G e t A t t r ib u t e
0.03% 1.3272ms 4 331.79 us 297.63 us 344.04 us cuDeviceTotalMem
0.02% 694.81 us 21 33.086 us 29.098 us 47.856 us cudaLaunch
0.01% 505.33 us 3 168.44 us 7.3170 us 330.65 us cudaFree
0.01% 237.02 us 4 59.255 us 56.222 us 67.280 us cuDeviceGetName
0.00% 119.83 us 147 815 ns 533 ns 12.646 us cudaSetupArgument
0.00% 28.884 us 21 1.3750 us 1.1180 us 1.7890 us cudaConfigureCall
0.00% 16.784 us 21 799 ns 740 ns 922 ns cudaGetLastError
0.00% 4.7380 us 8 592 ns 532 ns 788 ns cuDeviceGet
0.00% 3.9500 us 2 1.9750 us 895 ns 3.0550 us cuDeviceGetCount
4 Conclusion and Future Work
Our analysis depicts that we obtain a significant speedup (45X average) over the sequential execution of
K-Means clustering. In our project, we only parallelized the method to compute the nearest cluster. We
optimized the calculation of new clusters centers by adding members that changed membership to new
cluster groups and subtracting it from the old cluster center. This approach as opposed to re-calculating
cluster centers afresh, saved running significant time. Due to shortage of time, we were not able to quantify
the new speedup. There is definitely scope for increased speedup, (when input dimension increases) if we
attempt to parallelize the new cluster center calculation using CUDA.
10