SlideShare a Scribd company logo
1 of 11
Download to read offline
CS 240A - Parallel Implementation of K Means Clustering on
CUDA
Lan Liu, Pritha D N
December 9, 2016
Abstract
K-Means clustering is a popular clustering algorithm in data mining. Clustering large data sets can be
time consuming, and in an attempt to minimize this time, our project is a parallel implementation of K-
Means clustering algorithm on CUDA using C. We present the performance analysis and implementation
of our approach to parallelizing K-Means clustering.
1 K-Means Clustering
In this section, we provide an overview of K-Means clustering, the mathematical description and the
sequential algorithm has been presented, and the complexity of sequential code has been analyzed.
1.1 Description
K-Means clustering is one of the most widely used clustering method used in data mining, it aims to partition
N given data points into K clusters, in which each cluster has the most similarity, namely, each data point
belongs to the cluster with the nearest mean.
Assume there are N data points in d dimension, (x1, x2, ...., xn) ⊂ Rd, and we need to classify them into
K clusters S = (S1, S2, ..., SK ), where K is generally fixed apriori. Define the center of each cluster Si as the
mean of all data points x ∈ Si, i.e.
µi
x∈Si
x
|Si |
, where |Si | denotes size of Si
The goal of clustering is to minimize the total Euclidean distance from each data point to its cluster center,
i.e. find clustering S to minimize the objective function:
cost(S) =
k
i=1 x∈Si
||x − µi ||2
Finding the global minimum of the objective function is computationally challenging (NP-hard). The
commonly used algorithm is really a heuristic which can find a local minimum instead of global minimum,
and the trade off is that computational cost is cheaper. The commonly used k-means algorithm is as follows:
Step1: Initialize cluster centroids µ1, µ2, ..., µk ∈ Rn randomly.
Step2: Repeat until convergence:{
Assignment step: In order to assign each data point xi to the nearest cluster Si, define ci as membership
index of xi,
ci
:= argmink
j=1||xi − µj ||2
Update centroids step: For each j, update
µj :=
N
i=1 1ci =j xi
N
i=1 1ci =j
}
Since generally the clustering problem is not convex, there might be a lot of local minimum. For any
fixed initial condition, It can be easily proved that cost(S) decreases for every iteration step, and thus the
1
algorithm converges to a unique local minimum depending on which initial condition is given. For example,
set x = (2, 4, 5, 6), if initial centers are µ1 = 2, µ2 = 5, then S1 = (2), S2 = (4, 5, 6), if initial centers are
µ1 = 4, µ2 = 5, then S1 = (2, 4), S2 = (5, 6), the first is global minimal with cost=2, and the second is a
saddle with cost=2.5.
Consider this, people might generate a couple of result using random initial centers, and choose the best
with smallest objective function. And this also drives us to apply parallel computing to save running time.
1.2 Algorithm
Based on the description, we apply the following heuristic K Means clustering sequential algorithm, this
algorithm is based on the Professor Wei-keng Liao’s k-means clustering code [1]. We have modified the
code about how to recompute the new cluster center. In that code it sumerizes all data points in the new
cluster to compute center in every iteration step, considering that only a portion of data changes membership
(and being less and less as iteration goes), we instead treat the changing data by adding the data into new
cluster and removing it from old cluster, this will be more efficient and the result also verifies this.
Step 1: Pick the first K data points as initial cluster centers.
Step2: Attribute each data point to the nearest cluster.
Step3: For each reassigned data point, membership change increase by 1.
Step4: Set the position of each new cluster to be the mean of all data points belonging to that cluster.
Step 5: Repeat steps 2-4 until convergence.
The pseudo code is as follows:
Let N be the number of data points, K be the number of clusters.
data[N]: the array of data objects
center[K]: the array of cluster centers
membership[N]: the array of data point membership.
clustersum[K]: the sum of data points in Kth cluster.
clustersize[K]: the size of Kth cluster.
δ: count the number of membership change.
threshold: critical value to define stop condition, we set it be 0.001.
for i from 0 to K-1
center[i] ←− data[i]
do{
δ ←− 0
for i from 0 to N − 1
mindis=||data[i]-center[0]||
for j from 1 to K − 1
distance=||data[i]-center[j]||
if distance<mindis
mindis←−distance
index←−j
if first iteration
δ ←−N
membership[i]←−index
clustersize[index]←−clustersize[index]+1
clustersum[index]←−clustersum[index]+data[i]
else if membership[i] index
δ = δ + 1
clustersize[index]←−clustersize[index]+1
clustersize[membership[i]]←−clustersize[membership[i]]-1
clustersum[index]←−clustersum[index]+data[i]
clustersum[membership[i]]←−clustersum[membership[i]]-data[i]
membership[i]←−index
for j from 0 to K − 1
center[j]←−clustersum[j]/clustersize[j]
} While(δ/N>threshold)
2
Note: stop condition is δ/N<threshold, i.e., the number of membership changes is 1‰of all datas.
Complexity Analysis: The sequential code has complexity O(TKND), where T is the number of iterations,
K is number of clusters, N is number of data points, and D is the dimension of each data point. There are
two main parts in the code, the first part is to reassign each data point to the nearest center, this requires
to compute the distance between each data point with each cluster center for N data points, and thus the
complexity is O(NKD) for each iteration step. The second part is to compute the center for each new cluster
after the reassignment, it basically requires to compute K groups of means for N datas, and the complexity
is O((N + K) ∗ D) for each iteration step. Apparently, part 1 is the dominant time consuming part, and since
part 1 is an independent process for each data point, this inspires us to parallelize part1. The platform of
parallel code can ba MPI, CILK, OPENMP, CUDA, we will use CUDA on comet to do the parallelization
because of the high efficency of GPU processing large scale data.
2 Parallelization Of K-Means Using CUDA
In this section, we first introduce CUDA and GPUs nodes on Comet, and secondly discuss parallelization
strtegies and CUDA implementation on comet, the last part is to use CUDA Occupancy Calculator to
determine the optimal number of Threads per Block.
2.1 CUDA
CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model
invented by NVIDIA. It enables dramatic increases in computing performance by harnessing the power of
the graphics processing unit (GPU).
The graphics processing unit (GPU), as a specialized computer processor, addresses the demands of
real-time high-resolution 3D graphics compute-intensive tasks. GPUs have evolved into highly parallel
multi-core systems allowing very efficient manipulation of large blocks of data. In the computer game
industry, GPUs are used for graphics rendering, and for game physics calculations (physical effects such as
debris, smoke, fire, fluids); examples include PhysX and Bullet. CUDA has also been used to accelerate
non-graphical applications in computational biology, cryptography and other fields by an order of magnitude
or more.
CUDA works with all Nvidia GPUs from the G8x series onwards, including GeForce, Quadro and the
Tesla line.
2.1.1 NVIDIA GPUs on Comet
The Comet infrastructure provides 36 GPU Nodes. It contains the NVIDIA K80 GPUs. This series is
popularly called the Tesla GPU series by NVIDIA.
GPUs 2 NVIDIA K-80
Cores or socket 12
Sockets 2
Clock speed 2.5 GHz
Memory capacity 128 GB DDR4 DRAM
Memory bandwidth 120 GB/s
Flash memory 320 GB
Table 1: GPU node in comet
Figure 1: Left is how CPU process data, right is how GPU process data. In CPU, each core will process n/p
dates, and in GPU each thread get access to one single data.
3
Nvidia Tesla is Nvidia’s brand name for their products targeting stream processing and/or general purpose
GPU.
Tesla is Nvidia’s first microarchitecture to implement unified shaders. It was used with GeForce 8 Series,
GeForce 9 Series, GeForce 100 Series, GeForce 200 Series, and GeForce 300 Series of GPUs manufactured
in 90 nm, 80 nm, 65 nm, and 55 nm. It also found use in the GeForce 405, and in the workstation market in
the Quadro FX, Quadro x000, Quadro NVS series, and Nvidia Tesla computing modules.
With their very high computational power (measured in floating point operations per second or FLOPS)
compared to microprocessors, the Tesla products target the high performance computing market.
Physical Limits for GPU Compute Capability=3.7 (Tesla X80) are in Table 2.
Threads per Warp 32
Max Warps per Multiprocessor 64
Max Thread Blocks per Multiprocessor 16
Max Threads per Multiprocessor 2048
Maximum Thread Block Size 1024
Registers per Multiprocessor 131072
Max Registers per Thread Block 65536
Max Registers per Thread 255
Shared Memory per Multiprocessor (bytes) 114688
Max Shared Memory per Block 49152
Register allocation unit size 256
Register allocation granularity warp
Shared Memory allocation unit size 256
Warp allocation granularity 4
Table 2: Physical limits for GPU Compute Capability=3.7 (Tesla X80)
Figure 2: Thread Organization in CUDA
2.2 Parallelization of K-Means clustering on CUDA
CUDA uses CPU as its HOST, and GPU as its DEVICE. Each GPU node can get access to thousands of
threads, and each thread is processing one single data. The threads are grouped into block and shared
memory is restricted to each block. HOST and DEVICE do not share memory. Under this configuration,
we would have to mannully communicate message between HOST and DEVICE.
As explained in the last part of sec 1.2, we aim to parallelize the reassignment step for computing
distance between each data point and each cluster center. The logic and order of parallel algorithm is totally
the same with original sequantial algorithm, and we have to take into account the communication between
HOST and DEVICE in parallel algorithm:
Step0: HOST initialize cluster centers, copy N data coordinates to DEVICE.
Step1: DEVICE copy data membership and K cluster centers from HOST.
Step2: In DEVICE, each thread process a single data point, compute the distance between each cluster
center and update data membership. tid=blockDim.x * blockIdx.x + threadIdx.x.
Step3: HOST Copy the new data membership from DEVICE, and recompute cluster centers.
Step4: Repeat step 1-3 if not converges, go to step 5 if converges.
Step5: Host free the allocated memory.
There are several crucial points in parallel code, first, it is easier to handle 1D array than 2D array for
threads in GPU, thus we convert N data points(in D dimension) from a 2D array to a 1D array in HOST
4
and then send it to DEVICE. i.e. DEVICEdata[i*numCoordinates+j]=HOSTdata[i][j] (the jth coordinate of
the ith data point). Secondly, since different blocks do not share memory, we have to reduce the number of
membership change in each block to compute the total number of membership change.
In our implementation, we set: NumberThreadsPerClusterBlock=128
NumClusterBlocks=
(N+numThreadsPerClusterBlock − 1)
numThreadsPerClusterBlock
The correctness of the parallel algorithm is guaranteed, measured by that it produces the same clustering
as the original sequential k-means algorithm. Our implementation performs the same steps as the sequential
code in parallel without changing the logic, thus the correctness is expected.
2.3 Determining Optimal Number of Threads per Block
We used the CUDA Occupancy Calculator provided by NVIDIA to determine the optimal number of
threads per block.
2.3.1 Code Analysis
To determine the resource usage for each of the CUDA threads for our nearest cluster determining kernel,
we compiled our code with the ptxas nvcc option. The following is the ouput of the compilation:
ptxas i n f o : 0 bytes gmem
nvcc −g −pg −I . −DBLOCK_SHARED_MEM_OPTIMIZATION=0 −−ptxas −o p t i o n s =−v
−o cuda_kmeans . o −c cuda_kmeans . cu
ptxas i n f o : 0 bytes gmem
ptxas i n f o : Compiling e n t r y f u n c t i o n ’ _ Z 2 0 f i n d _ n e a r e s t _ c l u s t e r i i i P f S _ P i S 0 _ ’
f o r ’sm_20 ’
ptxas i n f o : Function p r o p e r t i e s f o r _ Z 2 0 f i n d _ n e a r e s t _ c l u s t e r i i i P f S _ P i S 0 _
0 bytes s t a c k frame , 0 bytes s p i l l s t o r e s , 0 bytes s p i l l loads
ptxas i n f o : Used 18 r e g i s t e r s , 80 bytes cmem[ 0 ]
2.3.2 CUDA Occupancy Calculator
The CUDA Occupancy Calculator [3] allows you to compute the multiprocessor occupancy of a GPU by a
given CUDA kernel. The multiprocessor occupancy is the ratio of active warps to the maximum number of
warps supported on a multiprocessor of the GPU. Each multiprocessor on the device has a set of N registers
available for use by CUDA thread programs. These registers are a shared resource that are allocated among
the thread blocks executing on a multiprocessor. The CUDA compiler attempts to minimize register usage
to maximize the number of thread blocks that can be active in the machine simultaneously. If a program
tries to launch a kernel for which the registers used per thread times the thread block size is greater than N,
the launch will fail.
Maximizing the occupancy can help to cover latency during global memory loads that are followed by a
__syncthreads(). The occupancy is determined by the amount of shared memory and registers used by each
thread block. Because of this, programmers need to choose the size of thread blocks with care in order to
maximize occupancy. This GPU Occupancy Calculator can assist in choosing thread block size based on
shared memory and register requirements.
For any input size, the shared memory used by our program is null. We use the CUDA Occupancy
Calculator with Number of threads per block = 128. From the dis-assembly of code, we find that 18
registers are required by the kernel function we have. We provide the Compute Capacity - 3.7 (GK210,
X80), number of threads per block - 128, shared memory size - 112KB (for 3.7) and number of registers
required per thread = 18, as input to the occupancy calculator. We indicate the results in the figures included.
Number of threads per block = 128 gave maximum occupancy for our program.
5
Figure 3: Input to CUDA Occupancy Calculator
Figure 4: Output of CUDA Occupancy Calculator
Figure 5: Impact of varying Block Size
Figure 6: Impact of varying Register Count Per Thread
Figure 7: Impact of varying Shared Memory Usage Per Block
6
3 Parallel Performance Analysis
In this section, we present several experiment result to reveal the parallel performance.
1. Experiment1: Vary the size of data set N, fix number of clusters K=128. The performance result
is on table 2 and figure 9. Data dimension is 1000.
For K=128 fixed, as the size of data sets N increase, parallel speed up is around 40 and increase
gradually. The memory capacity for GPU is 11GB in comet, and we have tested for 8GB data set
(2048000*1000), it turns out that the parallel running time is 6175sec (about 1.7 hour), the sequential
code is too slow to get the time, and we expect the time to be around 3 days. The parallel code
outperforms the sequential code a lot.
Size(float values) Sequential(in sec) Parallel (in sec) Speed up
51200*1000 463.09 11.73 39.5
76800*1000 857.73 19.25 44.55
89600*1000 1182.43 24.82 47.6
115200*1000 1676.96 35.22 47.6
128000*1000 1794.91 41.23 43.53
512000*1000 >4hrs 405.56 NA
2048000*1000 > 6174.72 -
Table 3: Experiment1. Parallel Performance when varying size, fix K=128
Figure 8: Experiment1. Parallel Speedup versus size of data set. K=128
2. Experiment2: For fixed size=51200*1000, vary the number of clusters, use δ/N < 0.001 as stop
condition. We have tested for K=4, 16, 64, 128, 256, 512, 1024, 2048, and the result is in table 3 and
figure 10, 11.
Figure 10 presents that parallel speed up keeps increasing as K increase, and the derivative of the
curve decrease. This matches with our expectation. The computational cost of part 1 in each iteration
for sequential code is O(N*D*K), of part2 is O((N+K)*D), in parallel code, only part1 has been
parallelized. After running T iterations, the speed up will be:
t1
tp
=
O(NDKT) + O((N + K)DT)
parallized + O((N + K)DT)
As K being larger, since N >> K, D > K, and N, D are both fixed, the time consuming of part2 is
steady, and the larger K is, the more speedup will be earned by part1. Thus overall the speed up
increases. Figure 11 shows that as K increase, for fixed N, the number of iteration needed to converge
decrease, this drives down the parallel running time even K increase at the first stage.
7
K(num of clusters) numofiteration Sequential(in sec) Parallel (in sec) Speed up
4 71 63.81 18.04 3.54
16 51 155.64 15.06 10.33
64 29 338.55 10.51 32.20
128 20 463.38 11.06 41.88
256 16 739.15 12.30 60.11
512 12 1105.98 13.14 84.14
1024 10 1842.00 18.98 97.04
2048 6 2207.78 21.19 104.17
Table 4: Experiment2.Performance when varying number of clusters K for fixed data size
Figure 9: Experiment2.Speed up versus number of clusters K for fixed data size
Figure 10: Experiment2.Parallel running time, numofiteration versus number of clusters K for fixed data
size
8
3. Experiment3: For fixed size=51200*1000, vary the number of clusters, fix number of itera-
tion=30. Tested for K=4, 16, 64, 128, 256. The result is in table 5 and Figure 12,13.
As in Experiment2, we get outstanding speed up and the speedup goes up as K goes up, as shown in
figure 12. Experiment3 reveal another scaling fact of the code, in figure 13 we plot the relationship
between log2(K) with log2(running time), for sequential case, it is very close to a straight line
with slope 1, and this coincide with the face that as K double, sequential running time will double,
considering the complexity to be O(NDKT) + O((N + K)DT, and for parallel case, it is a curve with
increasing slope which is always less than 1, and the fitting slope is 0.28. This implies a fact that the
parallelism is larger as K being larger, and gained more speedup.
K(num of clusters) Sequential(in sec) Parallel (in sec) Speed up
4 27.82 7.72 3.60
8 50.03 7.89 6.34
16 94.50 8.33 11.34
32 183.56 9.50 19.32
64 361.92 12.31 29.40
128 718.31 15.91 45.15
256 1430.74 21.76 65.76
512 2858.05 32.47 88.01
Table 5: Experiment3. Performance when changing K, fix numiteration=30
Figure 11: Experiment3.Speedup versus number of clusters K for fixed data size and fixed iterations
Figure 12: Experiment3: Rate of growth of sequential and parallel implementations
9
3.1 Profiling
nvprof[2] presents an overview of the GPU kernels and memory copies in our program. The summary,
groups all calls to the same kernel together, presenting the total time and percentage of the total application
time for each kernel. In addition to summary mode, nvprof supports GPU-Trace and API-Trace modes that
let you see a complete list of all kernel launches and memory copies, and in the case of API-Trace mode, all
CUDA API calls.
We perform 4 mallocs - one for the input 2D data, one for 2D cluster data, a 1D membership and one 1D
membership changed array. In every iteration, we perform two Device-Host copies for - membership and
membershipChanged and one Host-Device copy, copying the new cluster centers.
==180922== NVPROF i s p r o f i l i n g proces s 180922 , command : . / t e s t _ d r i v e r
==180922== P r o f i l i n g a p p l i c a t i o n : . / t e s t _ d r i v e r
N = 51200
dimension = 1000
k = 128
t h r e s h o l d = 0.0010
Type : P a r a l l e l
Computation timing = 11.8963 sec
Loop i t e r a t i o n s = 21
==180922== P r o f i l i n g r e s u l t :
Time(%) Time C a l l s Avg Min Max Name
98.99% 4.11982 s 21 196.18ms 195.58ms 197.96ms
f i n d _ n e a r e s t _ c l u s t e r ( int , int , int , f l o a t ∗ , f l o a t ∗ , i n t ∗ , i n t ∗)
0.98% 40.635ms 23 1.7668ms 30.624 us 39.102ms [CUDA memcpy HtoD ]
0.03% 1.2578ms 42 29.946 us 28.735 us 31.104 us [CUDA memcpy DtoH ]
==180922== API c a l l s :
Time(%) Time C a l l s Avg Min Max Name
93.06% 4.12058 s 21 196.22ms 195.62ms 198.00ms cudaDeviceSynchronize
5.79% 256.47ms 4 64.117ms 4.9510 us 255.97ms cudaMalloc
1.02% 45.072ms 65 693.42 us 82.267 us 39.230ms cudaMemcpy
0.06% 2.6048ms 332 7.8450 us 528 ns 282.45 us c u D e vi c e G e t A t t r ib u t e
0.03% 1.3272ms 4 331.79 us 297.63 us 344.04 us cuDeviceTotalMem
0.02% 694.81 us 21 33.086 us 29.098 us 47.856 us cudaLaunch
0.01% 505.33 us 3 168.44 us 7.3170 us 330.65 us cudaFree
0.01% 237.02 us 4 59.255 us 56.222 us 67.280 us cuDeviceGetName
0.00% 119.83 us 147 815 ns 533 ns 12.646 us cudaSetupArgument
0.00% 28.884 us 21 1.3750 us 1.1180 us 1.7890 us cudaConfigureCall
0.00% 16.784 us 21 799 ns 740 ns 922 ns cudaGetLastError
0.00% 4.7380 us 8 592 ns 532 ns 788 ns cuDeviceGet
0.00% 3.9500 us 2 1.9750 us 895 ns 3.0550 us cuDeviceGetCount
4 Conclusion and Future Work
Our analysis depicts that we obtain a significant speedup (45X average) over the sequential execution of
K-Means clustering. In our project, we only parallelized the method to compute the nearest cluster. We
optimized the calculation of new clusters centers by adding members that changed membership to new
cluster groups and subtracting it from the old cluster center. This approach as opposed to re-calculating
cluster centers afresh, saved running significant time. Due to shortage of time, we were not able to quantify
the new speedup. There is definitely scope for increased speedup, (when input dimension increases) if we
attempt to parallelize the new cluster center calculation using CUDA.
10
References
[1] KMeans Algorithm : http://users.eecs.northwestern.edu/ wkliao/Kmeans/index.html, Wei-keng Liao,
Northwestern University, 2005
[2] NVPROF : http://docs.nvidia.com/cuda/profiler-users-guide/#axzz4SHzfjCkf
[3] CUDA Occupancy Calculator : https://devtalk.nvidia.com/default/topic/368105/cuda-occupancy-
calculator-helps-pick-optimal-thread-block-size/
[4] Understanding CUDA : https://courses.engr.illinois.edu/ece498al/Syllabus.html
11

More Related Content

What's hot

IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」Preferred Networks
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda enKohei KaiGai
 
第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)RCCSRENKEI
 
Let's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwLet's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwJan Holčapek
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelKoichi Shirahata
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - EnglishKohei KaiGai
 
20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQLKohei KaiGai
 
20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_PlaceKohei KaiGai
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolversinside-BigData.com
 
20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdwKohei KaiGai
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaFerdinand Jamitzky
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~Kohei KaiGai
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsIgor Sfiligoi
 
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Kohei KaiGai
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storageKohei KaiGai
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemShuai Yuan
 

What's hot (20)

IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
 
Debugging CUDA applications
Debugging CUDA applicationsDebugging CUDA applications
Debugging CUDA applications
 
Cuda
CudaCuda
Cuda
 
第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)
 
Let's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwLet's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdw
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English
 
20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL
 
20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cuda
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
 
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
 

Viewers also liked

Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...
Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...
Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...Igor Korkin
 
Implementazione di un vincolo table su un CSP solver GPU-based
Implementazione di un vincolo table su un CSP solver GPU-basedImplementazione di un vincolo table su un CSP solver GPU-based
Implementazione di un vincolo table su un CSP solver GPU-basedTommaso Campari
 
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDA
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDASchulung: Einführung in das GPU-Computing mit NVIDIA CUDA
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDAJörn Dinkla
 
GPU, CUDA, OpenCL and OpenACC for Parallel Applications
GPU, CUDA, OpenCL and OpenACC for Parallel ApplicationsGPU, CUDA, OpenCL and OpenACC for Parallel Applications
GPU, CUDA, OpenCL and OpenACC for Parallel ApplicationsMarcos Gonzalez
 
PL/CUDA - GPU Accelerated In-Database Analytics
PL/CUDA - GPU Accelerated In-Database AnalyticsPL/CUDA - GPU Accelerated In-Database Analytics
PL/CUDA - GPU Accelerated In-Database AnalyticsKohei KaiGai
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduceVarad Meru
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Titus Damaiyanti
 
Social Media Basics & Application (for Indexers)
Social Media Basics & Application (for Indexers)Social Media Basics & Application (for Indexers)
Social Media Basics & Application (for Indexers)Sara Truscott
 
Grm 201 project
Grm 201 projectGrm 201 project
Grm 201 projectnmjameson
 
Intro to developing for @twitterapi (updated)
Intro to developing for @twitterapi (updated)Intro to developing for @twitterapi (updated)
Intro to developing for @twitterapi (updated)Raffi Krikorian
 
China's Younger Architects 2014
China's Younger Architects 2014China's Younger Architects 2014
China's Younger Architects 2014Joe Carter
 
Engaging Teens through Sprite Digital Campaign - Teen Till I Die
Engaging Teens through Sprite Digital Campaign - Teen Till I DieEngaging Teens through Sprite Digital Campaign - Teen Till I Die
Engaging Teens through Sprite Digital Campaign - Teen Till I DieNitin Karkara
 
Inspección de flores, etiquetas y facturas.
Inspección de flores, etiquetas y facturas. Inspección de flores, etiquetas y facturas.
Inspección de flores, etiquetas y facturas. ProColombia
 

Viewers also liked (19)

Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...
Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...
Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...
 
Implementazione di un vincolo table su un CSP solver GPU-based
Implementazione di un vincolo table su un CSP solver GPU-basedImplementazione di un vincolo table su un CSP solver GPU-based
Implementazione di un vincolo table su un CSP solver GPU-based
 
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDA
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDASchulung: Einführung in das GPU-Computing mit NVIDIA CUDA
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDA
 
CUDA-Aware MPI
CUDA-Aware MPICUDA-Aware MPI
CUDA-Aware MPI
 
GPU, CUDA, OpenCL and OpenACC for Parallel Applications
GPU, CUDA, OpenCL and OpenACC for Parallel ApplicationsGPU, CUDA, OpenCL and OpenACC for Parallel Applications
GPU, CUDA, OpenCL and OpenACC for Parallel Applications
 
PL/CUDA - GPU Accelerated In-Database Analytics
PL/CUDA - GPU Accelerated In-Database AnalyticsPL/CUDA - GPU Accelerated In-Database Analytics
PL/CUDA - GPU Accelerated In-Database Analytics
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
Parallel-kmeans
Parallel-kmeansParallel-kmeans
Parallel-kmeans
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
MapReduce in Simple Terms
MapReduce in Simple TermsMapReduce in Simple Terms
MapReduce in Simple Terms
 
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
 
Social Media Basics & Application (for Indexers)
Social Media Basics & Application (for Indexers)Social Media Basics & Application (for Indexers)
Social Media Basics & Application (for Indexers)
 
Grm 201 project
Grm 201 projectGrm 201 project
Grm 201 project
 
Intro to developing for @twitterapi (updated)
Intro to developing for @twitterapi (updated)Intro to developing for @twitterapi (updated)
Intro to developing for @twitterapi (updated)
 
China's Younger Architects 2014
China's Younger Architects 2014China's Younger Architects 2014
China's Younger Architects 2014
 
Engaging Teens through Sprite Digital Campaign - Teen Till I Die
Engaging Teens through Sprite Digital Campaign - Teen Till I DieEngaging Teens through Sprite Digital Campaign - Teen Till I Die
Engaging Teens through Sprite Digital Campaign - Teen Till I Die
 
Inspección de flores, etiquetas y facturas.
Inspección de flores, etiquetas y facturas. Inspección de flores, etiquetas y facturas.
Inspección de flores, etiquetas y facturas.
 
Ank 48
Ank 48Ank 48
Ank 48
 

Similar to Parallel Implementation of K Means Clustering on CUDA

A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelJenny Liu
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationGeoffrey Fox
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learningAmgad Muhammad
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCinside-BigData.com
 
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel ComputingIRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel ComputingIRJET Journal
 
Efficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda cEfficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda ccsandit
 
Efficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda cEfficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda ccsandit
 
Paper id 25201467
Paper id 25201467Paper id 25201467
Paper id 25201467IJRAT
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...David Walker
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETScsandit
 
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...RSIS International
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...NVIDIA Taiwan
 

Similar to Parallel Implementation of K Means Clustering on CUDA (20)

A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
An35225228
An35225228An35225228
An35225228
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
 
Ultra Fast SOM using CUDA
Ultra Fast SOM using CUDAUltra Fast SOM using CUDA
Ultra Fast SOM using CUDA
 
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel ComputingIRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
 
Project PPT
Project PPTProject PPT
Project PPT
 
Efficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda cEfficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda c
 
Efficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda cEfficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda c
 
Paper id 25201467
Paper id 25201467Paper id 25201467
Paper id 25201467
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
icwet1097
icwet1097icwet1097
icwet1097
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
 
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
 
Data miningpresentation
Data miningpresentationData miningpresentation
Data miningpresentation
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
 

Recently uploaded

Theory of Machine Notes / Lecture Material .pdf
Theory of Machine Notes / Lecture Material .pdfTheory of Machine Notes / Lecture Material .pdf
Theory of Machine Notes / Lecture Material .pdfShreyas Pandit
 
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSsandhya757531
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Communityprachaibot
 
Python Programming for basic beginners.pptx
Python Programming for basic beginners.pptxPython Programming for basic beginners.pptx
Python Programming for basic beginners.pptxmohitesoham12
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Erbil Polytechnic University
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
 
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptxTriangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptxRomil Mishra
 
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxCurve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxRomil Mishra
 
priority interrupt computer organization
priority interrupt computer organizationpriority interrupt computer organization
priority interrupt computer organizationchnrketan
 
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTFUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTSneha Padhiar
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
Turn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxTurn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxStephen Sitton
 
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHTEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHSneha Padhiar
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfDrew Moseley
 
A brief look at visionOS - How to develop app on Apple's Vision Pro
A brief look at visionOS - How to develop app on Apple's Vision ProA brief look at visionOS - How to develop app on Apple's Vision Pro
A brief look at visionOS - How to develop app on Apple's Vision ProRay Yuan Liu
 
Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionSneha Padhiar
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communicationpanditadesh123
 
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESCME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESkarthi keyan
 

Recently uploaded (20)

Theory of Machine Notes / Lecture Material .pdf
Theory of Machine Notes / Lecture Material .pdfTheory of Machine Notes / Lecture Material .pdf
Theory of Machine Notes / Lecture Material .pdf
 
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Community
 
Python Programming for basic beginners.pptx
Python Programming for basic beginners.pptxPython Programming for basic beginners.pptx
Python Programming for basic beginners.pptx
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
 
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptxTriangulation survey (Basic Mine Surveying)_MI10412MI.pptx
Triangulation survey (Basic Mine Surveying)_MI10412MI.pptx
 
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxCurve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
 
priority interrupt computer organization
priority interrupt computer organizationpriority interrupt computer organization
priority interrupt computer organization
 
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTFUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
Turn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxTurn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptx
 
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHTEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdf
 
Designing pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptxDesigning pile caps according to ACI 318-19.pptx
Designing pile caps according to ACI 318-19.pptx
 
A brief look at visionOS - How to develop app on Apple's Vision Pro
A brief look at visionOS - How to develop app on Apple's Vision ProA brief look at visionOS - How to develop app on Apple's Vision Pro
A brief look at visionOS - How to develop app on Apple's Vision Pro
 
Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based question
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communication
 
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESCME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
 

Parallel Implementation of K Means Clustering on CUDA

  • 1. CS 240A - Parallel Implementation of K Means Clustering on CUDA Lan Liu, Pritha D N December 9, 2016 Abstract K-Means clustering is a popular clustering algorithm in data mining. Clustering large data sets can be time consuming, and in an attempt to minimize this time, our project is a parallel implementation of K- Means clustering algorithm on CUDA using C. We present the performance analysis and implementation of our approach to parallelizing K-Means clustering. 1 K-Means Clustering In this section, we provide an overview of K-Means clustering, the mathematical description and the sequential algorithm has been presented, and the complexity of sequential code has been analyzed. 1.1 Description K-Means clustering is one of the most widely used clustering method used in data mining, it aims to partition N given data points into K clusters, in which each cluster has the most similarity, namely, each data point belongs to the cluster with the nearest mean. Assume there are N data points in d dimension, (x1, x2, ...., xn) ⊂ Rd, and we need to classify them into K clusters S = (S1, S2, ..., SK ), where K is generally fixed apriori. Define the center of each cluster Si as the mean of all data points x ∈ Si, i.e. µi x∈Si x |Si | , where |Si | denotes size of Si The goal of clustering is to minimize the total Euclidean distance from each data point to its cluster center, i.e. find clustering S to minimize the objective function: cost(S) = k i=1 x∈Si ||x − µi ||2 Finding the global minimum of the objective function is computationally challenging (NP-hard). The commonly used algorithm is really a heuristic which can find a local minimum instead of global minimum, and the trade off is that computational cost is cheaper. The commonly used k-means algorithm is as follows: Step1: Initialize cluster centroids µ1, µ2, ..., µk ∈ Rn randomly. Step2: Repeat until convergence:{ Assignment step: In order to assign each data point xi to the nearest cluster Si, define ci as membership index of xi, ci := argmink j=1||xi − µj ||2 Update centroids step: For each j, update µj := N i=1 1ci =j xi N i=1 1ci =j } Since generally the clustering problem is not convex, there might be a lot of local minimum. For any fixed initial condition, It can be easily proved that cost(S) decreases for every iteration step, and thus the 1
  • 2. algorithm converges to a unique local minimum depending on which initial condition is given. For example, set x = (2, 4, 5, 6), if initial centers are µ1 = 2, µ2 = 5, then S1 = (2), S2 = (4, 5, 6), if initial centers are µ1 = 4, µ2 = 5, then S1 = (2, 4), S2 = (5, 6), the first is global minimal with cost=2, and the second is a saddle with cost=2.5. Consider this, people might generate a couple of result using random initial centers, and choose the best with smallest objective function. And this also drives us to apply parallel computing to save running time. 1.2 Algorithm Based on the description, we apply the following heuristic K Means clustering sequential algorithm, this algorithm is based on the Professor Wei-keng Liao’s k-means clustering code [1]. We have modified the code about how to recompute the new cluster center. In that code it sumerizes all data points in the new cluster to compute center in every iteration step, considering that only a portion of data changes membership (and being less and less as iteration goes), we instead treat the changing data by adding the data into new cluster and removing it from old cluster, this will be more efficient and the result also verifies this. Step 1: Pick the first K data points as initial cluster centers. Step2: Attribute each data point to the nearest cluster. Step3: For each reassigned data point, membership change increase by 1. Step4: Set the position of each new cluster to be the mean of all data points belonging to that cluster. Step 5: Repeat steps 2-4 until convergence. The pseudo code is as follows: Let N be the number of data points, K be the number of clusters. data[N]: the array of data objects center[K]: the array of cluster centers membership[N]: the array of data point membership. clustersum[K]: the sum of data points in Kth cluster. clustersize[K]: the size of Kth cluster. δ: count the number of membership change. threshold: critical value to define stop condition, we set it be 0.001. for i from 0 to K-1 center[i] ←− data[i] do{ δ ←− 0 for i from 0 to N − 1 mindis=||data[i]-center[0]|| for j from 1 to K − 1 distance=||data[i]-center[j]|| if distance<mindis mindis←−distance index←−j if first iteration δ ←−N membership[i]←−index clustersize[index]←−clustersize[index]+1 clustersum[index]←−clustersum[index]+data[i] else if membership[i] index δ = δ + 1 clustersize[index]←−clustersize[index]+1 clustersize[membership[i]]←−clustersize[membership[i]]-1 clustersum[index]←−clustersum[index]+data[i] clustersum[membership[i]]←−clustersum[membership[i]]-data[i] membership[i]←−index for j from 0 to K − 1 center[j]←−clustersum[j]/clustersize[j] } While(δ/N>threshold) 2
  • 3. Note: stop condition is δ/N<threshold, i.e., the number of membership changes is 1‰of all datas. Complexity Analysis: The sequential code has complexity O(TKND), where T is the number of iterations, K is number of clusters, N is number of data points, and D is the dimension of each data point. There are two main parts in the code, the first part is to reassign each data point to the nearest center, this requires to compute the distance between each data point with each cluster center for N data points, and thus the complexity is O(NKD) for each iteration step. The second part is to compute the center for each new cluster after the reassignment, it basically requires to compute K groups of means for N datas, and the complexity is O((N + K) ∗ D) for each iteration step. Apparently, part 1 is the dominant time consuming part, and since part 1 is an independent process for each data point, this inspires us to parallelize part1. The platform of parallel code can ba MPI, CILK, OPENMP, CUDA, we will use CUDA on comet to do the parallelization because of the high efficency of GPU processing large scale data. 2 Parallelization Of K-Means Using CUDA In this section, we first introduce CUDA and GPUs nodes on Comet, and secondly discuss parallelization strtegies and CUDA implementation on comet, the last part is to use CUDA Occupancy Calculator to determine the optimal number of Threads per Block. 2.1 CUDA CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model invented by NVIDIA. It enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU). The graphics processing unit (GPU), as a specialized computer processor, addresses the demands of real-time high-resolution 3D graphics compute-intensive tasks. GPUs have evolved into highly parallel multi-core systems allowing very efficient manipulation of large blocks of data. In the computer game industry, GPUs are used for graphics rendering, and for game physics calculations (physical effects such as debris, smoke, fire, fluids); examples include PhysX and Bullet. CUDA has also been used to accelerate non-graphical applications in computational biology, cryptography and other fields by an order of magnitude or more. CUDA works with all Nvidia GPUs from the G8x series onwards, including GeForce, Quadro and the Tesla line. 2.1.1 NVIDIA GPUs on Comet The Comet infrastructure provides 36 GPU Nodes. It contains the NVIDIA K80 GPUs. This series is popularly called the Tesla GPU series by NVIDIA. GPUs 2 NVIDIA K-80 Cores or socket 12 Sockets 2 Clock speed 2.5 GHz Memory capacity 128 GB DDR4 DRAM Memory bandwidth 120 GB/s Flash memory 320 GB Table 1: GPU node in comet Figure 1: Left is how CPU process data, right is how GPU process data. In CPU, each core will process n/p dates, and in GPU each thread get access to one single data. 3
  • 4. Nvidia Tesla is Nvidia’s brand name for their products targeting stream processing and/or general purpose GPU. Tesla is Nvidia’s first microarchitecture to implement unified shaders. It was used with GeForce 8 Series, GeForce 9 Series, GeForce 100 Series, GeForce 200 Series, and GeForce 300 Series of GPUs manufactured in 90 nm, 80 nm, 65 nm, and 55 nm. It also found use in the GeForce 405, and in the workstation market in the Quadro FX, Quadro x000, Quadro NVS series, and Nvidia Tesla computing modules. With their very high computational power (measured in floating point operations per second or FLOPS) compared to microprocessors, the Tesla products target the high performance computing market. Physical Limits for GPU Compute Capability=3.7 (Tesla X80) are in Table 2. Threads per Warp 32 Max Warps per Multiprocessor 64 Max Thread Blocks per Multiprocessor 16 Max Threads per Multiprocessor 2048 Maximum Thread Block Size 1024 Registers per Multiprocessor 131072 Max Registers per Thread Block 65536 Max Registers per Thread 255 Shared Memory per Multiprocessor (bytes) 114688 Max Shared Memory per Block 49152 Register allocation unit size 256 Register allocation granularity warp Shared Memory allocation unit size 256 Warp allocation granularity 4 Table 2: Physical limits for GPU Compute Capability=3.7 (Tesla X80) Figure 2: Thread Organization in CUDA 2.2 Parallelization of K-Means clustering on CUDA CUDA uses CPU as its HOST, and GPU as its DEVICE. Each GPU node can get access to thousands of threads, and each thread is processing one single data. The threads are grouped into block and shared memory is restricted to each block. HOST and DEVICE do not share memory. Under this configuration, we would have to mannully communicate message between HOST and DEVICE. As explained in the last part of sec 1.2, we aim to parallelize the reassignment step for computing distance between each data point and each cluster center. The logic and order of parallel algorithm is totally the same with original sequantial algorithm, and we have to take into account the communication between HOST and DEVICE in parallel algorithm: Step0: HOST initialize cluster centers, copy N data coordinates to DEVICE. Step1: DEVICE copy data membership and K cluster centers from HOST. Step2: In DEVICE, each thread process a single data point, compute the distance between each cluster center and update data membership. tid=blockDim.x * blockIdx.x + threadIdx.x. Step3: HOST Copy the new data membership from DEVICE, and recompute cluster centers. Step4: Repeat step 1-3 if not converges, go to step 5 if converges. Step5: Host free the allocated memory. There are several crucial points in parallel code, first, it is easier to handle 1D array than 2D array for threads in GPU, thus we convert N data points(in D dimension) from a 2D array to a 1D array in HOST 4
  • 5. and then send it to DEVICE. i.e. DEVICEdata[i*numCoordinates+j]=HOSTdata[i][j] (the jth coordinate of the ith data point). Secondly, since different blocks do not share memory, we have to reduce the number of membership change in each block to compute the total number of membership change. In our implementation, we set: NumberThreadsPerClusterBlock=128 NumClusterBlocks= (N+numThreadsPerClusterBlock − 1) numThreadsPerClusterBlock The correctness of the parallel algorithm is guaranteed, measured by that it produces the same clustering as the original sequential k-means algorithm. Our implementation performs the same steps as the sequential code in parallel without changing the logic, thus the correctness is expected. 2.3 Determining Optimal Number of Threads per Block We used the CUDA Occupancy Calculator provided by NVIDIA to determine the optimal number of threads per block. 2.3.1 Code Analysis To determine the resource usage for each of the CUDA threads for our nearest cluster determining kernel, we compiled our code with the ptxas nvcc option. The following is the ouput of the compilation: ptxas i n f o : 0 bytes gmem nvcc −g −pg −I . −DBLOCK_SHARED_MEM_OPTIMIZATION=0 −−ptxas −o p t i o n s =−v −o cuda_kmeans . o −c cuda_kmeans . cu ptxas i n f o : 0 bytes gmem ptxas i n f o : Compiling e n t r y f u n c t i o n ’ _ Z 2 0 f i n d _ n e a r e s t _ c l u s t e r i i i P f S _ P i S 0 _ ’ f o r ’sm_20 ’ ptxas i n f o : Function p r o p e r t i e s f o r _ Z 2 0 f i n d _ n e a r e s t _ c l u s t e r i i i P f S _ P i S 0 _ 0 bytes s t a c k frame , 0 bytes s p i l l s t o r e s , 0 bytes s p i l l loads ptxas i n f o : Used 18 r e g i s t e r s , 80 bytes cmem[ 0 ] 2.3.2 CUDA Occupancy Calculator The CUDA Occupancy Calculator [3] allows you to compute the multiprocessor occupancy of a GPU by a given CUDA kernel. The multiprocessor occupancy is the ratio of active warps to the maximum number of warps supported on a multiprocessor of the GPU. Each multiprocessor on the device has a set of N registers available for use by CUDA thread programs. These registers are a shared resource that are allocated among the thread blocks executing on a multiprocessor. The CUDA compiler attempts to minimize register usage to maximize the number of thread blocks that can be active in the machine simultaneously. If a program tries to launch a kernel for which the registers used per thread times the thread block size is greater than N, the launch will fail. Maximizing the occupancy can help to cover latency during global memory loads that are followed by a __syncthreads(). The occupancy is determined by the amount of shared memory and registers used by each thread block. Because of this, programmers need to choose the size of thread blocks with care in order to maximize occupancy. This GPU Occupancy Calculator can assist in choosing thread block size based on shared memory and register requirements. For any input size, the shared memory used by our program is null. We use the CUDA Occupancy Calculator with Number of threads per block = 128. From the dis-assembly of code, we find that 18 registers are required by the kernel function we have. We provide the Compute Capacity - 3.7 (GK210, X80), number of threads per block - 128, shared memory size - 112KB (for 3.7) and number of registers required per thread = 18, as input to the occupancy calculator. We indicate the results in the figures included. Number of threads per block = 128 gave maximum occupancy for our program. 5
  • 6. Figure 3: Input to CUDA Occupancy Calculator Figure 4: Output of CUDA Occupancy Calculator Figure 5: Impact of varying Block Size Figure 6: Impact of varying Register Count Per Thread Figure 7: Impact of varying Shared Memory Usage Per Block 6
  • 7. 3 Parallel Performance Analysis In this section, we present several experiment result to reveal the parallel performance. 1. Experiment1: Vary the size of data set N, fix number of clusters K=128. The performance result is on table 2 and figure 9. Data dimension is 1000. For K=128 fixed, as the size of data sets N increase, parallel speed up is around 40 and increase gradually. The memory capacity for GPU is 11GB in comet, and we have tested for 8GB data set (2048000*1000), it turns out that the parallel running time is 6175sec (about 1.7 hour), the sequential code is too slow to get the time, and we expect the time to be around 3 days. The parallel code outperforms the sequential code a lot. Size(float values) Sequential(in sec) Parallel (in sec) Speed up 51200*1000 463.09 11.73 39.5 76800*1000 857.73 19.25 44.55 89600*1000 1182.43 24.82 47.6 115200*1000 1676.96 35.22 47.6 128000*1000 1794.91 41.23 43.53 512000*1000 >4hrs 405.56 NA 2048000*1000 > 6174.72 - Table 3: Experiment1. Parallel Performance when varying size, fix K=128 Figure 8: Experiment1. Parallel Speedup versus size of data set. K=128 2. Experiment2: For fixed size=51200*1000, vary the number of clusters, use δ/N < 0.001 as stop condition. We have tested for K=4, 16, 64, 128, 256, 512, 1024, 2048, and the result is in table 3 and figure 10, 11. Figure 10 presents that parallel speed up keeps increasing as K increase, and the derivative of the curve decrease. This matches with our expectation. The computational cost of part 1 in each iteration for sequential code is O(N*D*K), of part2 is O((N+K)*D), in parallel code, only part1 has been parallelized. After running T iterations, the speed up will be: t1 tp = O(NDKT) + O((N + K)DT) parallized + O((N + K)DT) As K being larger, since N >> K, D > K, and N, D are both fixed, the time consuming of part2 is steady, and the larger K is, the more speedup will be earned by part1. Thus overall the speed up increases. Figure 11 shows that as K increase, for fixed N, the number of iteration needed to converge decrease, this drives down the parallel running time even K increase at the first stage. 7
  • 8. K(num of clusters) numofiteration Sequential(in sec) Parallel (in sec) Speed up 4 71 63.81 18.04 3.54 16 51 155.64 15.06 10.33 64 29 338.55 10.51 32.20 128 20 463.38 11.06 41.88 256 16 739.15 12.30 60.11 512 12 1105.98 13.14 84.14 1024 10 1842.00 18.98 97.04 2048 6 2207.78 21.19 104.17 Table 4: Experiment2.Performance when varying number of clusters K for fixed data size Figure 9: Experiment2.Speed up versus number of clusters K for fixed data size Figure 10: Experiment2.Parallel running time, numofiteration versus number of clusters K for fixed data size 8
  • 9. 3. Experiment3: For fixed size=51200*1000, vary the number of clusters, fix number of itera- tion=30. Tested for K=4, 16, 64, 128, 256. The result is in table 5 and Figure 12,13. As in Experiment2, we get outstanding speed up and the speedup goes up as K goes up, as shown in figure 12. Experiment3 reveal another scaling fact of the code, in figure 13 we plot the relationship between log2(K) with log2(running time), for sequential case, it is very close to a straight line with slope 1, and this coincide with the face that as K double, sequential running time will double, considering the complexity to be O(NDKT) + O((N + K)DT, and for parallel case, it is a curve with increasing slope which is always less than 1, and the fitting slope is 0.28. This implies a fact that the parallelism is larger as K being larger, and gained more speedup. K(num of clusters) Sequential(in sec) Parallel (in sec) Speed up 4 27.82 7.72 3.60 8 50.03 7.89 6.34 16 94.50 8.33 11.34 32 183.56 9.50 19.32 64 361.92 12.31 29.40 128 718.31 15.91 45.15 256 1430.74 21.76 65.76 512 2858.05 32.47 88.01 Table 5: Experiment3. Performance when changing K, fix numiteration=30 Figure 11: Experiment3.Speedup versus number of clusters K for fixed data size and fixed iterations Figure 12: Experiment3: Rate of growth of sequential and parallel implementations 9
  • 10. 3.1 Profiling nvprof[2] presents an overview of the GPU kernels and memory copies in our program. The summary, groups all calls to the same kernel together, presenting the total time and percentage of the total application time for each kernel. In addition to summary mode, nvprof supports GPU-Trace and API-Trace modes that let you see a complete list of all kernel launches and memory copies, and in the case of API-Trace mode, all CUDA API calls. We perform 4 mallocs - one for the input 2D data, one for 2D cluster data, a 1D membership and one 1D membership changed array. In every iteration, we perform two Device-Host copies for - membership and membershipChanged and one Host-Device copy, copying the new cluster centers. ==180922== NVPROF i s p r o f i l i n g proces s 180922 , command : . / t e s t _ d r i v e r ==180922== P r o f i l i n g a p p l i c a t i o n : . / t e s t _ d r i v e r N = 51200 dimension = 1000 k = 128 t h r e s h o l d = 0.0010 Type : P a r a l l e l Computation timing = 11.8963 sec Loop i t e r a t i o n s = 21 ==180922== P r o f i l i n g r e s u l t : Time(%) Time C a l l s Avg Min Max Name 98.99% 4.11982 s 21 196.18ms 195.58ms 197.96ms f i n d _ n e a r e s t _ c l u s t e r ( int , int , int , f l o a t ∗ , f l o a t ∗ , i n t ∗ , i n t ∗) 0.98% 40.635ms 23 1.7668ms 30.624 us 39.102ms [CUDA memcpy HtoD ] 0.03% 1.2578ms 42 29.946 us 28.735 us 31.104 us [CUDA memcpy DtoH ] ==180922== API c a l l s : Time(%) Time C a l l s Avg Min Max Name 93.06% 4.12058 s 21 196.22ms 195.62ms 198.00ms cudaDeviceSynchronize 5.79% 256.47ms 4 64.117ms 4.9510 us 255.97ms cudaMalloc 1.02% 45.072ms 65 693.42 us 82.267 us 39.230ms cudaMemcpy 0.06% 2.6048ms 332 7.8450 us 528 ns 282.45 us c u D e vi c e G e t A t t r ib u t e 0.03% 1.3272ms 4 331.79 us 297.63 us 344.04 us cuDeviceTotalMem 0.02% 694.81 us 21 33.086 us 29.098 us 47.856 us cudaLaunch 0.01% 505.33 us 3 168.44 us 7.3170 us 330.65 us cudaFree 0.01% 237.02 us 4 59.255 us 56.222 us 67.280 us cuDeviceGetName 0.00% 119.83 us 147 815 ns 533 ns 12.646 us cudaSetupArgument 0.00% 28.884 us 21 1.3750 us 1.1180 us 1.7890 us cudaConfigureCall 0.00% 16.784 us 21 799 ns 740 ns 922 ns cudaGetLastError 0.00% 4.7380 us 8 592 ns 532 ns 788 ns cuDeviceGet 0.00% 3.9500 us 2 1.9750 us 895 ns 3.0550 us cuDeviceGetCount 4 Conclusion and Future Work Our analysis depicts that we obtain a significant speedup (45X average) over the sequential execution of K-Means clustering. In our project, we only parallelized the method to compute the nearest cluster. We optimized the calculation of new clusters centers by adding members that changed membership to new cluster groups and subtracting it from the old cluster center. This approach as opposed to re-calculating cluster centers afresh, saved running significant time. Due to shortage of time, we were not able to quantify the new speedup. There is definitely scope for increased speedup, (when input dimension increases) if we attempt to parallelize the new cluster center calculation using CUDA. 10
  • 11. References [1] KMeans Algorithm : http://users.eecs.northwestern.edu/ wkliao/Kmeans/index.html, Wei-keng Liao, Northwestern University, 2005 [2] NVPROF : http://docs.nvidia.com/cuda/profiler-users-guide/#axzz4SHzfjCkf [3] CUDA Occupancy Calculator : https://devtalk.nvidia.com/default/topic/368105/cuda-occupancy- calculator-helps-pick-optimal-thread-block-size/ [4] Understanding CUDA : https://courses.engr.illinois.edu/ece498al/Syllabus.html 11