SlideShare a Scribd company logo
Saraswati Mishra. Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 6, Issue 8, ( Part -5) August 2016, pp.67-71
www.ijera.com 67 | P a g e
An Efficient Method of Partitioning High Volumes of
Multidimensional Data for Parallel Clustering Algorithms
Saraswati Mishra1
, Avnish Chandra Suman2
1
Centre for Development of Telematics, New Delhi - 110030, India. saraswatimishra18@gmail.com
2
Centre for Development of Telematics, New Delhi - 110030, India. avnishchandrasuman@gmail.com
ABSTRACT
An optimal data partitioning in parallel/distributed implementation of clustering algorithms is a necessary
computation as it ensures independent task completion, fair distribution, less number of affected points and
better & faster merging. Though partitioning using Kd-Tree is being conventionally used in academia, it suffers
from performance drenches and bias (non equal distribution) as dimensionality of data increases and hence is
not suitable for practical use in industry where dimensionality can be of order of 100’s to 1000’s. To address
these issues we propose two new partitioning techniques using existing mathematical models & study their
feasibility, performance (bias and partitioning speed) & possible variants in choosing initial seeds. First method
uses an n-dimensional hashed grid based approach which is based on mapping the points in space to a set of
cubes which hashes the points. Second method uses a tree of voronoi planes where each plane corresponds to a
partition. We found that grid based approach was computationally impractical, while using a tree of voronoi
planes (using scalable K-Means++ initial seeds) drastically outperformed the Kd-tree tree method as
dimensionality increased.
Keywords: Clustering, Data Partitioning, Parallel Processing, KD Tree, KD-Tree, Voronoi Diagrams
I. INTRODUCTION
While profiling a hybrid (parallel &
distributed) implementation of OPTICS (Goel et all.,
2015) algorithm we had an observation that over
50% our threads were outperforming the rest by
huge margins. Our method was to 1. partition
existing data into x parts 2. treat each part as a
separate input and run the parallel OPTICS thread on
each 3. merge the resultant clusters. We used Kd –
tree (Bentley, 1975) to make partitions. K-d tree is a
variance of binary tree where each node represents a
data point in n-dimensional space. Every leaf node
of k-d tree represents a splitting of a (n-1)
dimensional hyper-plane resulting in two half-planes
which can be thought of as partitions. Following
method was used.
1. Start with dimension having highest variance
and divide data set based on value points in that
dimension only. Result is two different
partitions.
2. Repeat the process with next dimension of
highest variance until desired numbers of
partitions are made.
A sample representation will look
something like this
Figure 1 : Kd-Tree Partitioning
We observed various limitations. Assume x
partitions, n data points, m partitions, d dimensions
1. Data scan was needed in every stage i.e. for
getting m partitions we needed O(m) scans over
same points again and again.
2. Median finding was a costly operation of order
O(n). However when executed for every non-
leaf node , the overall cost was of order O(mn)
3. K-d tree doesn't guarantee considering every
dimension. In-fact there it exhibits a bias
towards a few dimensions of high variance,
hence points in a partition can be relatively
RESEARCH ARTICLE OPEN ACCESS
Saraswati Mishra. Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 6, Issue 8, ( Part -5) August 2016, pp.67-71
www.ijera.com 68 | P a g e
far than points in different dimensions for a
large d. This will lead to a poor clustering result.
Also as the number of dimensions tend to
increase, the performance of k-d tree with
regards to bias decreases.
4. Partitions are half planes. They have a general
tendency to result in more number of affected
points in uniformly or near-uniformly
distributed space.
5. No inherent merging structure. A merging
strategy needs to be implemented
The most concerning of these at that time
for us was 1 & 2 as they were highly serial and slow
part of our parallel implementation. We wanted to
optimize the multiple passes over data to as few as
possible. Also we wanted a better way of finding
median or near median. In next section we will see
how we tried solving 1 & 2 by splitting our d-
dimensional space into d-cubes such that in a single
pass we determine the points in every cube & their
cardinality and other computations which improves
median finding performance significantly. However
we soon realized the impractical aspects of our
approach and it inherently lacking the solutions to 3,
4 & 5. We moved on to next approach where we
partition our space using n-dimensional voronoi
diagrams and found it to be extraordinarily
outperforming kd-tree with proper initial seeds.
II. N-CUBE BASED APPROACH
The proposed partitioning approach works
by initially dividing the n-dimensional space into n-
Cubes by making y splits along each dimension so
that we obtain (y+1)n
cubes.
The basic idea here is to create an overall
index (a summary in space based on cubes) such that
for any partitioning computation, we only need to
use this summary and not the entire data. n-Cubes
work as following
Algorithm: Construct Cubes
Input: Set of x points p in space (p1,p2...px) where
each px is (x1,x2..xn).
Number of partitions m
k: a natural number 1<=k
#higher values for k ensures better overall
distribution but lower performance.
Output: Set of Cubes (c1, c2 …. cm )
Where m = (y+1)n
Procedure:
# Finding an optimal y & initializing cube
boundaries
Let minxn be min (px) in nth
dimension
Let maxxn be max (px) in nth
dimension
Let Y = ky. Let M = (Y+1)n
for each ci in (c1..cM)
Boundary(cMn) =
{(i-1)*(maxn-minn)/M , i*(maxn-minn)/M}
Binary sort c
For each pi in (p1...px)
Find pi’s location in p.
Add pi to cM
cM.totalPonits++
Algorithm: Find Median
Input : Set of Cubes (c1...cM), Total Points
Output : Median along a dimension n say mn
Procedure :
P : total points in space
X=0;
While X<P/2
Move to next cube , x = x+Cm.totalPoints
#We stop at the cube that contains our median (or a
close approximation if k is too low or too high)
Find median of set of points in Cm.
We can see that cube creation is an
O(n+logn) task while median creation is an O(n/M)
task which looks very efficient . The approach
should have worked in two passes over data plus a
single pass on grid cells. However there were major
design & implementation issues with this approach
1. The number of cubes is (y+1)n
. Even if y is 2,
for a huge n, we get 2n
cells. This grows closer
to total number of data points.
2. Programmatically unfeasible in direct sense.
Accessing n-dimensional arrays needs n loops,
we don’t know n beforehand. Solution is to
convert n-d cells to 1-d (resulting in very long
1-d array). Unable to allocate memory on stack
for the 1-d array.
3. Too many cells, sparse cells, data distribution
across cells not uniform at all.
4. Most of the partitions will be empty, even when
the number of data points N is large, leading to
extreme waste of memory and CPU time.
5. Hence we conclude that the method was not
Suitable for > 2d data
6. Didn’t solve problem of bias but tends to
worsen it with a higher cubes number
III. VORONOI DIAGRAM BASED
PARTITIONING
Our second approach involves voronoi
diagrams (Aurenhammer,1991). Our goal is to
construct a partitioning scheme that handles high-
dimensionality as well and not just provide good
performance by ignoring a lot of dimensions. It is
also necessary that partitions should not hold too
many or too few points. Number of passes on data
should be as less as possible preferably ~1.
An efficient way to satisfy problem 1 & 2
of kd-tree is a tree structure that needs O(log n)
number of comparisons on average to distribute a
Saraswati Mishra. Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 6, Issue 8, ( Part -5) August 2016, pp.67-71
www.ijera.com 69 | P a g e
point and to determine the affected partitions where
n is the number of nodes in the tree. One way to
satisfy problem of considering all dimensions all
together is to use a Voronoi diagram which
partitions the space into Voronoi cells, directed by a
set of split points Q = q1, q2, . . . such that for each
cell corresponding to split point qi, the points x in
that cell are nearer to qi than to any other split point
in Q. Hence, by constructing a tree of Voronoi
diagrams, we can satisfy our two major concerns.
Let’s call such a structure as v_tree. The
top or root node of a v_tree gives a brief summary of
the whole data and is split to many Voronoi cells
which are split as well and so on.
Figure 2: A voronoi diagram depicted in two
dimensions
Construction of v_tree
At each level, find k (2) centre
1. Points must be appropriately spaced and far
2. Use scalable kmeans++ (Bahmani et all., 2012)
or gnat technique (Brin ,1995)
3. Assign all points to either centre based on
distance
4. All points that lie in current to current + eps
boundary (Goel et all., 2015) are considered
affected.
5. Delete extra points stored in parent node to
remove redundancy
Repeat for new sets until requisite number of
partitions found.
Data Structure
A minimal n-voronoi-tree implementation data
structure in c will look like this.
typedef struct member {
int id;
float *val;
} member;
struct v_node {
int level;
int core1,core2;
member *mem1,*mem2,*mem_overlap;
struct v_node *left,*right;
int count_mem1,count_mem2, count_overlap,
total_count;
}
struct v_tree {
int levels;
struct node *head;
}
Head node is the summary of Entire Data.
Each parent node contains summary of points in
child nodes. The exact points are stored until data is
partitioned at level & deleted as soon as we move to
next level.
Note that v_tree might not necessarily be
binary. More than two centres can be chosen.
The leaf nodes are our final partitions, and
going up the tree inherently makes a merging
structure for resultant clusters. Load in each partition
will depend upon the center chosen and distribution
of data in n-dimensional space.
How to choose initial seeds?
1. Select randomly: Choosing centre randomly
doesn’t guarantee or breach anything and
simply leaves thing to the centre chosen and
distribution of data. There is equal probability
of getting each load distribution. Hence the
probability of getting a perfect load balance
tends to zero.
2. GNAT Approach: Suppose we need n seeds.
We start by selecting a point at random. The
next point is chosen such that its distance from
first point is maximum, the third point is chosen
such that its distance from sum of previous two
points is maximum. Similarly forth point should
be the point farthest from sum of first three
points and so on. This approach is good in terms
of load balancing for a big value of n, but
cannot guarantee good balance for small n (~2-
5).
3. Scalable K-Means++ Approach: Suppose k
centre are needed, C be the set of initial seeds,
then
a. Sample a point uniformly at random from the
data points.
b. For each data point p, compute it’s distance
from nearest centre.
c. Choose a m point p using Weighted probability
distribution where a point p is chosen with
probability proportional to D(p)2. Assume it be
new centre
d. If you have k centres, proceed with partitioning,
else repeat 2 & 3
The centres that we get tend to be close to
the centroids of clusters present in data, assuming
there are k-clusters. Load distribution is entirely
dependent on the data; however we can be optimistic
about not getting very biased distribution with real-
life datasets.
Saraswati Mishra. Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 6, Issue 8, ( Part -5) August 2016, pp.67-71
www.ijera.com 70 | P a g e
Such a partitioning hugely tends to reduce
the time spent in merging the final clusters as there
will be minimal affected points.
4. More than 2 centre: can be used in
implementation to satisfy various criteria.
a. All the above approaches (or any random or
probability based approach) will tend to good
balancing as we increase number of centre. This
can be proved mathematically using induction.
We receive a perfect balance when numbers of
points equals number of centre i.e. each point is
a partition.
b. In a case when number of partitions is not in
power of 2.
c. Different number of centre at different level can
be used for perfect guided partitioning.
5. How about Median: Points very close to median
and on same axis as centre will produce exact
partitions (similar to kd-tree), however
complexity sumps up to O (kd-tree) +O
(v_tree).
IV. EXPERIMENTATION & RESULTS
We used following environment to execute
test & profile. Execution Environment: - Ubuntu
13.04, Intel Core i3 (3rd generation), 2.4GHz (2
cores hyper threaded), 4GB RAM, 3MB L2 Cache.
Compilation Environment: - C, gcc, gdb, gprof,
vampir, Geany IDE .Visualization of output was
done using geogebra.
Test Data-sets: (no. of points x no. of
dimensions, double precision data)
1. 100x2,
2. 700x9,
3. 1500x1024,
4. 4000x1024,
5. 40000x1024
Figure 3 : Comparison of v_tree partitioning result
with four partitions
Figure 4: Comparison of v_tree partitioning result
with eight partitions
We noticed an increase in load bias as
number of partitions increase.
Distribution was almost uniform with
uniform data. However we can see that Kd-tree
(right) provides a better distribution with uniformly
distributed data. However this will decrease with
number of dimensions.
Figure 5 : Comparison of uniform data partitioning
(v_tree )
Figure 6 Comparison of uniform data partitioning
(k-d tree)
Saraswati Mishra. Int. Journal of Engineering Research and Application www.ijera.com
ISSN : 2248-9622, Vol. 6, Issue 8, ( Part -5) August 2016, pp.67-71
www.ijera.com 71 | P a g e
Performance (in approx ~10sec units)
Data Set
Approach
700*9 1500*
1024
4000*1024 40000*1024
Kd - Tree 0.02 9.88 46.67 512
v_tree
(Scalable
Kmeans++)
0.01 3.05 7.61 73.34
v_tree
(Median)
0.02 9.96 54.43 542
V. CONCLUSION
We have seen that for higher
dimensionality the approach proposed takes very
less time as compared to kd-tree approach; however
v_tree technique needs a load balanced variant to
boast perfect results.
Our future works will include better load
balancing, comparison of merging time and
distribution time, parallelizing the approach and
implementing a grid based alternative.
APPENDIX
Related Code, Data Sets & Results can be requested
from
https://drive.google.com/file/d/0Bxo9wQ432jhla1dQ
NWFrbnM4bnc/view?usp=sharing
REFERENCES
[1]. Poonam Goyal, Sonal Kumari, Dhruv
Kumar, Sundar Balasubramaniam, Navneet
Goyal, Saiyedul Islam, and Jagat Sesh
Challa. 2015. Parallelizing OPTICS for
Commodity Clusters. In Proceedings of the
2015 International Conference on
Distributed Computing and Networking
(ICDCN '15). ACM, New York, NY, USA,
Article 33
[2]. J. L. Bentley. Multidimensional binary
search trees used for associative searching.
Communications of the ACM, 18(9):509-
517, 1975.
[3]. Franz Aurenhammer (1991). Voronoi
Diagrams – A Survey of a Fundamental
Geometric Data Structure. ACM Computing
Surveys, 23(3):345–405, 1991
[4]. B. Bahmani, B. Moseley, A. Vattani, R.
Kumar, S. Vassilvitskii "Scalable K-
means++" 2012 Proceedings of the VLDB
Endowment.
[5]. S. Brin, Near neighbor search in large
metric spaces, in: Proceedings of the
International Conference on Very Large
Databases (VLDB), 1995.

More Related Content

What's hot

Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2IAEME Publication
 
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...butest
 
Clustering on database systems rkm
Clustering on database systems rkmClustering on database systems rkm
Clustering on database systems rkm
Vahid Mirjalili
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
theijes
 
Birch
BirchBirch
Birch
ngocdiem87
 
T24144148
T24144148T24144148
T24144148
IJERA Editor
 
Clustering, k-means clustering
Clustering, k-means clusteringClustering, k-means clustering
Clustering, k-means clustering
Megha Sharma
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
Anna Fensel
 
Histogram-Based Method for Effective Initialization of the K-Means Clustering...
Histogram-Based Method for Effective Initialization of the K-Means Clustering...Histogram-Based Method for Effective Initialization of the K-Means Clustering...
Histogram-Based Method for Effective Initialization of the K-Means Clustering...Gingles Caroline
 
Dynamic clustering algorithm using fuzzy c means
Dynamic clustering algorithm using fuzzy c meansDynamic clustering algorithm using fuzzy c means
Dynamic clustering algorithm using fuzzy c means
Wrishin Bhattacharya
 
Range, quartiles, and interquartile range
Range, quartiles, and interquartile rangeRange, quartiles, and interquartile range
Range, quartiles, and interquartile range
swarna sudha
 
K mean-clustering
K mean-clusteringK mean-clustering
K mean-clustering
Afzaal Subhani
 
One dimensional vector based pattern
One dimensional vector based patternOne dimensional vector based pattern
One dimensional vector based pattern
ijcsit
 
DATA MINING:Clustering Types
DATA MINING:Clustering TypesDATA MINING:Clustering Types
DATA MINING:Clustering Types
Ashwin Shenoy M
 
Birch1
Birch1Birch1
Self-organizing map
Self-organizing mapSelf-organizing map
Self-organizing map
Tarat Diloksawatdikul
 

What's hot (20)

data_mining
data_miningdata_mining
data_mining
 
Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2
 
Clique
Clique Clique
Clique
 
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
 
Clustering on database systems rkm
Clustering on database systems rkmClustering on database systems rkm
Clustering on database systems rkm
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Birch
BirchBirch
Birch
 
T24144148
T24144148T24144148
T24144148
 
Birch
BirchBirch
Birch
 
Clustering, k-means clustering
Clustering, k-means clusteringClustering, k-means clustering
Clustering, k-means clustering
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 
Histogram-Based Method for Effective Initialization of the K-Means Clustering...
Histogram-Based Method for Effective Initialization of the K-Means Clustering...Histogram-Based Method for Effective Initialization of the K-Means Clustering...
Histogram-Based Method for Effective Initialization of the K-Means Clustering...
 
Dynamic clustering algorithm using fuzzy c means
Dynamic clustering algorithm using fuzzy c meansDynamic clustering algorithm using fuzzy c means
Dynamic clustering algorithm using fuzzy c means
 
Range, quartiles, and interquartile range
Range, quartiles, and interquartile rangeRange, quartiles, and interquartile range
Range, quartiles, and interquartile range
 
K mean-clustering
K mean-clusteringK mean-clustering
K mean-clustering
 
One dimensional vector based pattern
One dimensional vector based patternOne dimensional vector based pattern
One dimensional vector based pattern
 
Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
 
DATA MINING:Clustering Types
DATA MINING:Clustering TypesDATA MINING:Clustering Types
DATA MINING:Clustering Types
 
Birch1
Birch1Birch1
Birch1
 
Self-organizing map
Self-organizing mapSelf-organizing map
Self-organizing map
 

Similar to An Efficient Method of Partitioning High Volumes of Multidimensional Data for Parallel Clustering Algorithms

Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering Algorithm
IJERA Editor
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
csandit
 
An Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data FragmentsAn Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data Fragments
IJMER
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive Indexing
IRJET Journal
 
A Review on Color Recognition using Deep Learning and Different Image Segment...
A Review on Color Recognition using Deep Learning and Different Image Segment...A Review on Color Recognition using Deep Learning and Different Image Segment...
A Review on Color Recognition using Deep Learning and Different Image Segment...
IRJET Journal
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
prithan
 
K means report
K means reportK means report
K means report
Gaurav Handa
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
Geoffrey Fox
 
multiarmed bandit.ppt
multiarmed bandit.pptmultiarmed bandit.ppt
multiarmed bandit.ppt
LPrashanthi
 
Scalable and efficient cluster based framework for multidimensional indexing
Scalable and efficient cluster based framework for multidimensional indexingScalable and efficient cluster based framework for multidimensional indexing
Scalable and efficient cluster based framework for multidimensional indexing
eSAT Journals
 
Scalable and efficient cluster based framework for
Scalable and efficient cluster based framework forScalable and efficient cluster based framework for
Scalable and efficient cluster based framework for
eSAT Publishing House
 
Data clustering using kernel based
Data clustering using kernel basedData clustering using kernel based
Data clustering using kernel based
IJITCA Journal
 
Predicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensemblesPredicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensembles
Varad Meru
 
Map-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on MulticoreMap-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicore
illidan2004
 
A Dependent Set Based Approach for Large Graph Analysis
A Dependent Set Based Approach for Large Graph AnalysisA Dependent Set Based Approach for Large Graph Analysis
A Dependent Set Based Approach for Large Graph Analysis
Editor IJCATR
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and lda
Suresh Pokharel
 
Modelling the Clustering Coefficient of a Random graph
Modelling the Clustering Coefficient of a Random graphModelling the Clustering Coefficient of a Random graph
Modelling the Clustering Coefficient of a Random graph
Graph-TA
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
Eng. Dr. Dennis N. Mwighusa
 
A detailed analysis of the supervised machine Learning Algorithms
A detailed analysis of the supervised machine Learning AlgorithmsA detailed analysis of the supervised machine Learning Algorithms
A detailed analysis of the supervised machine Learning Algorithms
NIET Journal of Engineering & Technology (NIETJET)
 

Similar to An Efficient Method of Partitioning High Volumes of Multidimensional Data for Parallel Clustering Algorithms (20)

Optimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering AlgorithmOptimising Data Using K-Means Clustering Algorithm
Optimising Data Using K-Means Clustering Algorithm
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
 
An Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data FragmentsAn Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data Fragments
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive Indexing
 
A Review on Color Recognition using Deep Learning and Different Image Segment...
A Review on Color Recognition using Deep Learning and Different Image Segment...A Review on Color Recognition using Deep Learning and Different Image Segment...
A Review on Color Recognition using Deep Learning and Different Image Segment...
 
Noura2
Noura2Noura2
Noura2
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
K means report
K means reportK means report
K means report
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
 
multiarmed bandit.ppt
multiarmed bandit.pptmultiarmed bandit.ppt
multiarmed bandit.ppt
 
Scalable and efficient cluster based framework for multidimensional indexing
Scalable and efficient cluster based framework for multidimensional indexingScalable and efficient cluster based framework for multidimensional indexing
Scalable and efficient cluster based framework for multidimensional indexing
 
Scalable and efficient cluster based framework for
Scalable and efficient cluster based framework forScalable and efficient cluster based framework for
Scalable and efficient cluster based framework for
 
Data clustering using kernel based
Data clustering using kernel basedData clustering using kernel based
Data clustering using kernel based
 
Predicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensemblesPredicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensembles
 
Map-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on MulticoreMap-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicore
 
A Dependent Set Based Approach for Large Graph Analysis
A Dependent Set Based Approach for Large Graph AnalysisA Dependent Set Based Approach for Large Graph Analysis
A Dependent Set Based Approach for Large Graph Analysis
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and lda
 
Modelling the Clustering Coefficient of a Random graph
Modelling the Clustering Coefficient of a Random graphModelling the Clustering Coefficient of a Random graph
Modelling the Clustering Coefficient of a Random graph
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
A detailed analysis of the supervised machine Learning Algorithms
A detailed analysis of the supervised machine Learning AlgorithmsA detailed analysis of the supervised machine Learning Algorithms
A detailed analysis of the supervised machine Learning Algorithms
 

Recently uploaded

Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
camseq
 
digital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdfdigital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdf
drwaing
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Soumen Santra
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
Kerry Sado
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
ssuser7dcef0
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
gestioneergodomus
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
zwunae
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 

Recently uploaded (20)

Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
 
digital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdfdigital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdf
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 

An Efficient Method of Partitioning High Volumes of Multidimensional Data for Parallel Clustering Algorithms

  • 1. Saraswati Mishra. Int. Journal of Engineering Research and Application www.ijera.com ISSN : 2248-9622, Vol. 6, Issue 8, ( Part -5) August 2016, pp.67-71 www.ijera.com 67 | P a g e An Efficient Method of Partitioning High Volumes of Multidimensional Data for Parallel Clustering Algorithms Saraswati Mishra1 , Avnish Chandra Suman2 1 Centre for Development of Telematics, New Delhi - 110030, India. saraswatimishra18@gmail.com 2 Centre for Development of Telematics, New Delhi - 110030, India. avnishchandrasuman@gmail.com ABSTRACT An optimal data partitioning in parallel/distributed implementation of clustering algorithms is a necessary computation as it ensures independent task completion, fair distribution, less number of affected points and better & faster merging. Though partitioning using Kd-Tree is being conventionally used in academia, it suffers from performance drenches and bias (non equal distribution) as dimensionality of data increases and hence is not suitable for practical use in industry where dimensionality can be of order of 100’s to 1000’s. To address these issues we propose two new partitioning techniques using existing mathematical models & study their feasibility, performance (bias and partitioning speed) & possible variants in choosing initial seeds. First method uses an n-dimensional hashed grid based approach which is based on mapping the points in space to a set of cubes which hashes the points. Second method uses a tree of voronoi planes where each plane corresponds to a partition. We found that grid based approach was computationally impractical, while using a tree of voronoi planes (using scalable K-Means++ initial seeds) drastically outperformed the Kd-tree tree method as dimensionality increased. Keywords: Clustering, Data Partitioning, Parallel Processing, KD Tree, KD-Tree, Voronoi Diagrams I. INTRODUCTION While profiling a hybrid (parallel & distributed) implementation of OPTICS (Goel et all., 2015) algorithm we had an observation that over 50% our threads were outperforming the rest by huge margins. Our method was to 1. partition existing data into x parts 2. treat each part as a separate input and run the parallel OPTICS thread on each 3. merge the resultant clusters. We used Kd – tree (Bentley, 1975) to make partitions. K-d tree is a variance of binary tree where each node represents a data point in n-dimensional space. Every leaf node of k-d tree represents a splitting of a (n-1) dimensional hyper-plane resulting in two half-planes which can be thought of as partitions. Following method was used. 1. Start with dimension having highest variance and divide data set based on value points in that dimension only. Result is two different partitions. 2. Repeat the process with next dimension of highest variance until desired numbers of partitions are made. A sample representation will look something like this Figure 1 : Kd-Tree Partitioning We observed various limitations. Assume x partitions, n data points, m partitions, d dimensions 1. Data scan was needed in every stage i.e. for getting m partitions we needed O(m) scans over same points again and again. 2. Median finding was a costly operation of order O(n). However when executed for every non- leaf node , the overall cost was of order O(mn) 3. K-d tree doesn't guarantee considering every dimension. In-fact there it exhibits a bias towards a few dimensions of high variance, hence points in a partition can be relatively RESEARCH ARTICLE OPEN ACCESS
  • 2. Saraswati Mishra. Int. Journal of Engineering Research and Application www.ijera.com ISSN : 2248-9622, Vol. 6, Issue 8, ( Part -5) August 2016, pp.67-71 www.ijera.com 68 | P a g e far than points in different dimensions for a large d. This will lead to a poor clustering result. Also as the number of dimensions tend to increase, the performance of k-d tree with regards to bias decreases. 4. Partitions are half planes. They have a general tendency to result in more number of affected points in uniformly or near-uniformly distributed space. 5. No inherent merging structure. A merging strategy needs to be implemented The most concerning of these at that time for us was 1 & 2 as they were highly serial and slow part of our parallel implementation. We wanted to optimize the multiple passes over data to as few as possible. Also we wanted a better way of finding median or near median. In next section we will see how we tried solving 1 & 2 by splitting our d- dimensional space into d-cubes such that in a single pass we determine the points in every cube & their cardinality and other computations which improves median finding performance significantly. However we soon realized the impractical aspects of our approach and it inherently lacking the solutions to 3, 4 & 5. We moved on to next approach where we partition our space using n-dimensional voronoi diagrams and found it to be extraordinarily outperforming kd-tree with proper initial seeds. II. N-CUBE BASED APPROACH The proposed partitioning approach works by initially dividing the n-dimensional space into n- Cubes by making y splits along each dimension so that we obtain (y+1)n cubes. The basic idea here is to create an overall index (a summary in space based on cubes) such that for any partitioning computation, we only need to use this summary and not the entire data. n-Cubes work as following Algorithm: Construct Cubes Input: Set of x points p in space (p1,p2...px) where each px is (x1,x2..xn). Number of partitions m k: a natural number 1<=k #higher values for k ensures better overall distribution but lower performance. Output: Set of Cubes (c1, c2 …. cm ) Where m = (y+1)n Procedure: # Finding an optimal y & initializing cube boundaries Let minxn be min (px) in nth dimension Let maxxn be max (px) in nth dimension Let Y = ky. Let M = (Y+1)n for each ci in (c1..cM) Boundary(cMn) = {(i-1)*(maxn-minn)/M , i*(maxn-minn)/M} Binary sort c For each pi in (p1...px) Find pi’s location in p. Add pi to cM cM.totalPonits++ Algorithm: Find Median Input : Set of Cubes (c1...cM), Total Points Output : Median along a dimension n say mn Procedure : P : total points in space X=0; While X<P/2 Move to next cube , x = x+Cm.totalPoints #We stop at the cube that contains our median (or a close approximation if k is too low or too high) Find median of set of points in Cm. We can see that cube creation is an O(n+logn) task while median creation is an O(n/M) task which looks very efficient . The approach should have worked in two passes over data plus a single pass on grid cells. However there were major design & implementation issues with this approach 1. The number of cubes is (y+1)n . Even if y is 2, for a huge n, we get 2n cells. This grows closer to total number of data points. 2. Programmatically unfeasible in direct sense. Accessing n-dimensional arrays needs n loops, we don’t know n beforehand. Solution is to convert n-d cells to 1-d (resulting in very long 1-d array). Unable to allocate memory on stack for the 1-d array. 3. Too many cells, sparse cells, data distribution across cells not uniform at all. 4. Most of the partitions will be empty, even when the number of data points N is large, leading to extreme waste of memory and CPU time. 5. Hence we conclude that the method was not Suitable for > 2d data 6. Didn’t solve problem of bias but tends to worsen it with a higher cubes number III. VORONOI DIAGRAM BASED PARTITIONING Our second approach involves voronoi diagrams (Aurenhammer,1991). Our goal is to construct a partitioning scheme that handles high- dimensionality as well and not just provide good performance by ignoring a lot of dimensions. It is also necessary that partitions should not hold too many or too few points. Number of passes on data should be as less as possible preferably ~1. An efficient way to satisfy problem 1 & 2 of kd-tree is a tree structure that needs O(log n) number of comparisons on average to distribute a
  • 3. Saraswati Mishra. Int. Journal of Engineering Research and Application www.ijera.com ISSN : 2248-9622, Vol. 6, Issue 8, ( Part -5) August 2016, pp.67-71 www.ijera.com 69 | P a g e point and to determine the affected partitions where n is the number of nodes in the tree. One way to satisfy problem of considering all dimensions all together is to use a Voronoi diagram which partitions the space into Voronoi cells, directed by a set of split points Q = q1, q2, . . . such that for each cell corresponding to split point qi, the points x in that cell are nearer to qi than to any other split point in Q. Hence, by constructing a tree of Voronoi diagrams, we can satisfy our two major concerns. Let’s call such a structure as v_tree. The top or root node of a v_tree gives a brief summary of the whole data and is split to many Voronoi cells which are split as well and so on. Figure 2: A voronoi diagram depicted in two dimensions Construction of v_tree At each level, find k (2) centre 1. Points must be appropriately spaced and far 2. Use scalable kmeans++ (Bahmani et all., 2012) or gnat technique (Brin ,1995) 3. Assign all points to either centre based on distance 4. All points that lie in current to current + eps boundary (Goel et all., 2015) are considered affected. 5. Delete extra points stored in parent node to remove redundancy Repeat for new sets until requisite number of partitions found. Data Structure A minimal n-voronoi-tree implementation data structure in c will look like this. typedef struct member { int id; float *val; } member; struct v_node { int level; int core1,core2; member *mem1,*mem2,*mem_overlap; struct v_node *left,*right; int count_mem1,count_mem2, count_overlap, total_count; } struct v_tree { int levels; struct node *head; } Head node is the summary of Entire Data. Each parent node contains summary of points in child nodes. The exact points are stored until data is partitioned at level & deleted as soon as we move to next level. Note that v_tree might not necessarily be binary. More than two centres can be chosen. The leaf nodes are our final partitions, and going up the tree inherently makes a merging structure for resultant clusters. Load in each partition will depend upon the center chosen and distribution of data in n-dimensional space. How to choose initial seeds? 1. Select randomly: Choosing centre randomly doesn’t guarantee or breach anything and simply leaves thing to the centre chosen and distribution of data. There is equal probability of getting each load distribution. Hence the probability of getting a perfect load balance tends to zero. 2. GNAT Approach: Suppose we need n seeds. We start by selecting a point at random. The next point is chosen such that its distance from first point is maximum, the third point is chosen such that its distance from sum of previous two points is maximum. Similarly forth point should be the point farthest from sum of first three points and so on. This approach is good in terms of load balancing for a big value of n, but cannot guarantee good balance for small n (~2- 5). 3. Scalable K-Means++ Approach: Suppose k centre are needed, C be the set of initial seeds, then a. Sample a point uniformly at random from the data points. b. For each data point p, compute it’s distance from nearest centre. c. Choose a m point p using Weighted probability distribution where a point p is chosen with probability proportional to D(p)2. Assume it be new centre d. If you have k centres, proceed with partitioning, else repeat 2 & 3 The centres that we get tend to be close to the centroids of clusters present in data, assuming there are k-clusters. Load distribution is entirely dependent on the data; however we can be optimistic about not getting very biased distribution with real- life datasets.
  • 4. Saraswati Mishra. Int. Journal of Engineering Research and Application www.ijera.com ISSN : 2248-9622, Vol. 6, Issue 8, ( Part -5) August 2016, pp.67-71 www.ijera.com 70 | P a g e Such a partitioning hugely tends to reduce the time spent in merging the final clusters as there will be minimal affected points. 4. More than 2 centre: can be used in implementation to satisfy various criteria. a. All the above approaches (or any random or probability based approach) will tend to good balancing as we increase number of centre. This can be proved mathematically using induction. We receive a perfect balance when numbers of points equals number of centre i.e. each point is a partition. b. In a case when number of partitions is not in power of 2. c. Different number of centre at different level can be used for perfect guided partitioning. 5. How about Median: Points very close to median and on same axis as centre will produce exact partitions (similar to kd-tree), however complexity sumps up to O (kd-tree) +O (v_tree). IV. EXPERIMENTATION & RESULTS We used following environment to execute test & profile. Execution Environment: - Ubuntu 13.04, Intel Core i3 (3rd generation), 2.4GHz (2 cores hyper threaded), 4GB RAM, 3MB L2 Cache. Compilation Environment: - C, gcc, gdb, gprof, vampir, Geany IDE .Visualization of output was done using geogebra. Test Data-sets: (no. of points x no. of dimensions, double precision data) 1. 100x2, 2. 700x9, 3. 1500x1024, 4. 4000x1024, 5. 40000x1024 Figure 3 : Comparison of v_tree partitioning result with four partitions Figure 4: Comparison of v_tree partitioning result with eight partitions We noticed an increase in load bias as number of partitions increase. Distribution was almost uniform with uniform data. However we can see that Kd-tree (right) provides a better distribution with uniformly distributed data. However this will decrease with number of dimensions. Figure 5 : Comparison of uniform data partitioning (v_tree ) Figure 6 Comparison of uniform data partitioning (k-d tree)
  • 5. Saraswati Mishra. Int. Journal of Engineering Research and Application www.ijera.com ISSN : 2248-9622, Vol. 6, Issue 8, ( Part -5) August 2016, pp.67-71 www.ijera.com 71 | P a g e Performance (in approx ~10sec units) Data Set Approach 700*9 1500* 1024 4000*1024 40000*1024 Kd - Tree 0.02 9.88 46.67 512 v_tree (Scalable Kmeans++) 0.01 3.05 7.61 73.34 v_tree (Median) 0.02 9.96 54.43 542 V. CONCLUSION We have seen that for higher dimensionality the approach proposed takes very less time as compared to kd-tree approach; however v_tree technique needs a load balanced variant to boast perfect results. Our future works will include better load balancing, comparison of merging time and distribution time, parallelizing the approach and implementing a grid based alternative. APPENDIX Related Code, Data Sets & Results can be requested from https://drive.google.com/file/d/0Bxo9wQ432jhla1dQ NWFrbnM4bnc/view?usp=sharing REFERENCES [1]. Poonam Goyal, Sonal Kumari, Dhruv Kumar, Sundar Balasubramaniam, Navneet Goyal, Saiyedul Islam, and Jagat Sesh Challa. 2015. Parallelizing OPTICS for Commodity Clusters. In Proceedings of the 2015 International Conference on Distributed Computing and Networking (ICDCN '15). ACM, New York, NY, USA, Article 33 [2]. J. L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509- 517, 1975. [3]. Franz Aurenhammer (1991). Voronoi Diagrams – A Survey of a Fundamental Geometric Data Structure. ACM Computing Surveys, 23(3):345–405, 1991 [4]. B. Bahmani, B. Moseley, A. Vattani, R. Kumar, S. Vassilvitskii "Scalable K- means++" 2012 Proceedings of the VLDB Endowment. [5]. S. Brin, Near neighbor search in large metric spaces, in: Proceedings of the International Conference on Very Large Databases (VLDB), 1995.