Searching for Centers: An Efficient Approach to the
Clustering of Large Data Sets Using P-trees
number of data items makes it unsuited to large data
Abstract sets. Many improvements have been implemented
[3,4,5], such as CLARA  and CLARANS  but
With the ever-increasing data-set sizes in most data they don't address the fundamental issue, namely that
mining applications, speed remains a central goal in the algorithm inherently depends on the combined
clustering. We present an approach that avoids both choice of cluster centers, and its complexity thereby
the time complexity of partition-based algorithms must scale essentially as the square of the number of
and the storage requirements of density-based ones, investigated sites. In this paper we analyze the
while being based on the same fundamental premises origins of this unfavorable scaling and see how it can
as standard partition- and density-based algorithms. be eliminate it at a fundamental level. Our idea is to
Our idea is motivated by taking an unconventional make the criterion for a "good" cluster center
perspective that puts three of the most popular independent of the locations of all other cluster
clustering algorithms, k-medoids, k-means, and centers. At a fundamental level this replaces the
center-defined DENCLUE into the same context. quadratic dependency on the number of investigated
We suggest an implementation of our idea that uses sites by a linear dependency.
P-trees 1 for efficient value-based data access.
We note that our proposed solution is not entirely
new but can be seen as implemented in the density-
1. Introduction based clustering algorithm DENCLUE  albeit with
a different justification. This allows us to separate
Many things change in data mining applications but representation issues from a more fundamental
one fact is reliably staying the same, namely that complexity question when discussing the concept of
next year's problems will involve larger data sets and an influence function as introduced in DENCLUE.
tougher performance requirements than this year's.
The datasets that are available to biological
applications keep growing continuously, since there
2. Taking a Fresh Look at
is commonly no reason to remove old data and Established Algorithms
experiments worldwide contribute new data .
Network applications are operating on massive Partition-based and density-based algorithms are
amounts of data, and there is no limit in sight to the commonly seen as fundamentally and technically
increase in network traffic, the increase in detail of distinct, and proposed combinations work on an
information that should be kept and evaluated, and applied rather than a fundamental level . We will
increasing demands on the speed with which data present three of the most popular techniques from
should be analyzed. The World-Wide Web both categories in a context that allows us to see their
constitutes another data mining area with common idea independently of their implementation.
continuously growing "data set" sizes. The list could This will allow us to combine elements from each of
be continued almost indefinitely. them and design an algorithm that is fast without
requiring any clustering-specific data structures.
It is therefore of ultimate importance to see where
the scaling behavior of standard algorithms may be The existing algorithms we consider in detail are the
improved without losing their benefits, so as not to k-medoids  and k-means  partitioning
make them obsolete over time. A clustering techniques and the center-defined version of
technique that has caused much research in this DENCLUE . The goal of these algorithms is to
direction is the k-medoids  algorithm. Although it group a data item with a cluster center that represents
has a simple justification and useful clustering its properties well. The clustering process has two
properties the default scaling behavior for its time parts that are strongly related for the algorithms we
complexity as being proportional to the square of the review, but will be separated for our clustering
algorithm. The first part consists in finding cluster
Ptree technology is patented to North Dakota State centers while the second specifies boundaries of the
University clusters. We first look at strategies that are used to
determine cluster centers. Since the k-medoids calculation of forces and can be ignored.
algorithm is commonly seen as producing a useful
clustering, we start by reviewing its definition.
2.1. K-Medoids Clustering as a Search
A good clustering in k-medoids is defined through
the minimum of a cost function. The most common
choice of cost function is the sum of squared
Euclidean distances between each data item and its
closest cluster center. An alternative way of looking
at this definition borrows ideas from physics: We
can look at cluster centers as particles that are
attracted to the data points. The potential that
describes the attraction for each data item is taken to Figure 1: Energy landscape (black) and potential of
be a quadratic function in the Euclidean distance as individual data items (gray) in k-medoids and k-
defined in the d-dimensional space of all attributes. means
The energy landscape surrounding a cluster center
with position X(m) is the sum of the individual Cluster boundaries are given as the points of equal
potentials of data items at locations X(i) distances to the closest cluster centers. This means
N d that for the k-means and k-medoids algorithms the
E ( X ( m ) ) = ∑∑ ( x (ji ) − x (jm ) ) 2 energy landscape depends on the cluster centers. The
i =1 j =1 difference between k-means and k-medoids lies in
where N is the number of data items that are assumed the way the system of all cluster centers is updated.
to influence the cluster center. We defer the If cluster centers change, the energy landscape will
discussion on the influence of cluster boundaries also change. The k-means algorithm moves cluster
until later. It can be seen that the potential that centers to the current mean of the data points and
arises from more than one data point will continue to thereby corresponds to a simple hill-climbing
be quadratic, since the sum of quadratic functions is algorithm for the minimization of the total energy of
again a quadratic function. We can calculate the all cluster centers. (Note that the "hill-climbing"
location of its minimum as follows: refers to a search for maxima whereas we are looking
for minima.) For the k-medoids algorithm the
E ( X ( m ) ) = −2∑ ( x (ji ) − x (jm ) ) = 0 attempt is made to explore the space of all cluster-
center locations completely. The reason why the k-
∂x (jm ) i =1
medoids algorithm is so much more robust than k-
Therefore we can see that the minimum of the means therefore can be traced to their different
potential is the mean of coordinates of the data update strategies.
points to which it is attracted.
1 N (i ) 2.2. Replacing a Many-Body Problem
x (jm ) = ∑xj
N i =1 by a Single-Body Problem
This result may surprise since it suggests that the We have now seen that the energy landscape that
potential minima and thereby the equilibrium cluster centers feel depends on the cluster
positions for the cluster centers in the k-medoids boundaries, and thereby on the location of all other
algorithm should be the mean, or rather the data item cluster centers. In the physics language the problem
closest to it, given the constraint that k-medoids that has to be solved is a many-body problem,
cluster centers must be data items. This may seem because many cluster centers are simultaneously
surprising since the k-medoids algorithm is known to involved in the minimization. Recognizing the
be significantly more robust than k-means which inherent complexity of many-body problems we
explicitly takes means as cluster centers. In order to consider ways of redefining our problem such that
understand this seemingly inconsistent result we we can look at one cluster center at a time, i.e.,
have to remember that the k-medoids and k-means replacing the many-body problem with a single-body
algorithms look not only for an equilibrium position problem. Our first idea may be to simply ignore all
of the cluster center within any one cluster, but rather but one cluster center while keeping the quadratic
for a minimum of the total energy of all cluster potential of all data points. Clearly this is not going
centers. To understand this we now have to look at to provide us with a useful energy landscape: If a
how cluster boundaries would be modeled in the cluster center feels the quadratic potential of all data
analogous physical system. Since data items don't points there will only be one minimum in the energy
attract cluster centers that are located outside of their landscape and that will be the mean of all data points
cluster, we model their potential as being quadratic - a trivial result. Let us therefore analyze what
within a cluster and continuing as a constant outside caused the non-trivial result in the k-medoids / k-
a cluster. Constant potentials are irrelevant for the
means case: Each cluster center only interacted with considered valid cluster centers because each new
data points that were close to it, namely in the same choice of one cluster center will change the cost or
cluster. A natural idea is therefore to limit the range "energy" for all others. Decoupling cluster centers
of attraction of data points independently of any immediately reduces the complexity to being linear
cluster shapes or locations. Limiting the range of an in the search space. Using a Gaussian influence
attraction corresponds to letting the potential function or potential achieves the goal of decoupling
approach a constant at large distances. We therefore cluster centers while leading to results that have been
look for a potential that is quadratic for small proven useful in the context of the density-based
distances and approaches a constant for large ones. clustering method DENCLUE.
A natural choice for such a potential is a Gaussian
function. 3.1. Searching for Equilibrium
Locations of Cluster Centers
We view the clustering process as a search for
equilibrium in an energy landscape that is given by
the sum of the Gaussian influences of all data points
( d ( X , X ( i ) )) 2
E ( X ) = −∑ e 2σ 2
where the distance d is taken to be the Euclidean
distance calculated in the d-dimensional space of all
Figure 2: Energy landscape (black) and potential of d ( X , X (i ) ) = ∑(x j − x (ji ) ) 2
individual data items (gray) for a Gaussian influence j =1
function analogous to DENCLUE It is important to note that the improvement in
efficiency by using this configuration-independent
The potential we have motivated can easily be function rather than the k-medoids cost function, is
identified with a Gaussian influence function in the unrelated to the representation of the data points.
density-based algorithm DENCLUE . Note that DENCLUE takes the density-based approach of
the constant shifts and opposite sign that distinguish representing data points in the space of their attribute
the potential that arise from our motivation from the values. This design choice does not automatically
one used in DENCLUE do not affect the follow from a configuration-independent influence
optimization problem. Similarly, we can identify the function.
energy with the (negative of the) density landscape in
DENCLUE. This observation allows us to draw In fact one could envision a very simple
immediate conclusions on the quality of cluster implementation of this idea in which starting points
centers generated by our approach: DENCLUE are chosen in a random or equidistant fashion and a
cluster centers have been shown to be as useful as k- hill-climbing method is implemented that optimizes
medoid ones . We would not expect them to be all cluster center candidates simultaneously using
identical because we are solving a slightly different one database scan in each optimization step. In this
problem, but it is not clear apriori, which definition paper we will describe a method that uses a general-
of a cluster center is going to result in a better purpose data structure, namely a P-tree that gives us
clustering quality. fast value-based access to counts. The benefits of this
implementation are that starting positions can be
chosen efficiently, and optimization can be done for
3. Idea one cluster center candidate at a time, allowing a
more informed choice of the number of candidates.
Having motivated a uniform view of partition
clustering as a search for equilibrium of cluster As a parameter for our minimization we have to
centers in an energy landscape of attracting data choose the width of the Gaussian function, σ. σ
items we now proceed to describe our own specifies the range for which the potential
algorithm. It is clear that the computational approximates a quadratic function. Two Gaussian
complexity of a problem that can be solved for each functions have to be at least 2σ apart to be separated
cluster center independently will be significantly by a minimum. That means that the smallest clusters
smaller than the complexity of minimizing a function we can get have diameter 2σ. The number of
of all cluster centers. The state space that has to be clusters that our algorithm finds will be determined
searched for a k-medoid based algorithm must scale by σ rather than being predefined as for k-medoids /
as the square of the number of sites that are k-means. For areas in which data points are more
widely spaced than 2σ each data point would be starting point. To find the next starting point we
considered an individual cluster. This is undesirable select the largest count we have ignored so far and
since widely spaced points are likely to be due to repeat the iterative process. We compare total
noise. We will exclude them and group the counts, which are commonly larger at higher levels.
corresponding data points with the nearest larger Therefore starting points are likely to be in different
cluster instead. In our algorithm this corresponds to high-level hyper cubes, i.e. well separated. This is
ignoring starting points for which the potential desirable since points that have high counts but are
minimum is not deeper than a threshold of -ξ. close are likely to belong to the same cluster. We
continue the process of deriving new starting points
until minimization consistently terminates in cluster
3.2. Defining Cluster Boundaries centers that have been discovered previously.
Our algorithm fundamentally differs from
DENCLUE in that we will not try to map out the 4. Algorithm using P-Trees
entire space. We will instead rely on estimates as to
whether our sampling of space is sufficient for We describe an implementation that uses P-trees, a
finding all or nearly all minima. That means that we data structure that has been shown to be appropriate
replace the complete mapping in DENCLUE with a for many data mining tasks [9,10,11].
search strategy. As a consequence we will get no
information on cluster boundaries. Center-defined 4.1. A Summary of P-Tree Features
DENCLUE considers data points as cluster members
if they are density attracted to the cluster center. P-trees represent a non-key attribute in the domain
This approach is consistent in the framework of space of the key attributes of a relation. One P-tree
density-based clustering but there are drawbacks to corresponds to one bit of one non-key attribute. It
the definition. For many applications it is hard to maintains pre-computed counts in a hierarchical
argue that a data point is considered as belonging to fashion. Total counts for given non-key attribute
a cluster other than the one defined by the cluster values or ranges can easily be computed by an "and"
center it is closest to. If one cluster has many operation on separate P-trees. This "and" operation
members in the data set used while clustering it will can also be used to derive counts at different levels
appear larger than its neighbors with few members. within the tree, corresponding to different values or
Placing the cluster boundary according to the ranges of the key attribute.
attraction regions would only be appropriate if we
could be sure that the distribution will remain the Two situations can be distinguished: The keys of a
same for any future data set. This is a stronger relation may either be data mining relevant
assumption than the general clustering hypothesis themselves or they may only serve to distinguish
that the cluster centers that represent data points well tuples. Examples of relations in which the keys are
will be the same. not data mining relevant are data streams where time
is the key of the relation but will often not be
We will therefore keep the k-means / k-medoids included as a dimension in data mining. Similarly,
definition of a cluster boundary that always places in spatial data mining, geographical location is
data points with the cluster center they are closest to. usually the key of the relation but is commonly not
Not only does this approach avoid the expensive step included in the mining process. In other situations
of precisely mapping out the shape of clusters. It the keys of the relation do themselves contain data
also allows us to determine cluster membership by a mining relevant information. Examples of such
simple distance calculation and without the need of relations are fact tables in data warehouses, where
referring to an extensive map. one fact, such as "sales", is given in the space of all
attributes that are expected to affect it. Similarly, in
3.3. Selecting Starting Points genomics, gene expression data is commonly
represented within the space of parameters that are
An important benefit of using proximity as the expected to influence it. In this case the P-tree based
definition for cluster membership is that we can representation is similar to density-based
choose starting points for the optimization that are representations such as the one used in DENCLUE.
already high in density and thereby reduce the One challenge of such a representation is that it has
number of optimization steps. high demands on storage. P-trees use efficient
compression in which subtrees that consist entirely
Our heuristics for finding good starting points is as of 0 values or entirely of 1 values are removed.
follows: We start by breaking up the n-dimensional Additional problems man arise if data is represented
hyper space into 2n hyper cubes by dividing each in the space of non-key attributes: information may
dimension into two equally sized sections. We select be lost because the same point in space may
the hypercube with the largest total count of data represent many tuples. For a P-tree based approach
points while keeping track of the largest counts we we only represent data in the space of the data
are ignoring. The process is iteratively repeated until mining relevant attributes if these attributes are keys.
we reach the smallest granularity that our In either case we get fast value-based access to
representation affords. The result specifies our first counts through "and" operations.
as for 2 neighboring points in each dimension
P-trees not only give us access to counts based on the (distance s) we can proceed in standard hill-climbing
values of attributes in individual tuples, they also fashion. We replace central point with the point that
contain information on counts at any level in a has the lowest energy, assuming it lowers the energy.
hierarchy of hyper cubes where each level If no point has lower energy, the step size s is
corresponds to a refinement by a factor of two in reduced. If the step size is already at its minimum
each attribute dimension. Note that this hierarchy is we consider the point a cluster center. If the cluster
not necessarily represented as the tree structure of center has already been found in a previous
the P-tree. For attributes that are not themselves minimization we ignore it. If we repeatedly
keys to a relation, the values in the hierarchy are rediscover old cluster centers we stop.
derived as needed by successive "and" operations,
starting with the highest order bit, and successively
"and"ing with lower order bits.
We have shown how the complexity of the k-
4.2. The Algorithm medoids clustering algorithm can be addressed at a
fundamental level and proposed an algorithm that
Our algorithm has two steps that are iterated for each makes use of our suggested modification. In the k-
possible cluster center. The goal of the first step is medoids algorithm a cost function is calculated in
to find a good initial starting point by looking for an which the contribution of any one cluster center
attribute combination with a high count in an area depends on every other cluster center. This
that has not previously been used. Note that in a dependency can be avoided if the influence of far-
situation where the entire key of the relation is away data items is limited in a configuration-
among the data mining relevant attributes, the count independent fashion. We suggest using a Gaussian
of an individual attribute combination can only be 0 function for this purpose and identify it with the
or 1. In such a case we stay at a higher level in the Gaussian influence in the density-based clustering
hierarchy when determining a good starting point. In algorithm DENCLUE. Our derivation allows us to
order to find a point with a high count, we start at the separate the representation issues that distinguish
highest level in the hierarchy for the combination of density-based algorithms from partition-based ones
all attributes that we consider. At every level we from the fundamental complexity issues that follow
select the highest count while keeping track of other from the definition of the minimization problem.
high counts that we ignore. This gives us a suitable We suggest an implementation that uses the most
starting point in b steps where b is the number of efficient aspects of both approaches. Using P-trees
levels in the hierarchy or the number of bits of the in our implementation allows us to further improve
respective attributes. We directly use the new efficiency.
starting point to find the nearest cluster center.
As minimization step we evaluate neighboring
points, with a distance (step size) s, in the energy
landscape. This requires the fast evaluation of a References
superposition of Gaussian functions. We intervalize  N. Goodman, S. Rozen, and L. Stein, "A glimpse
distances and then calculate counts within the at the DBMS challenges posed by the human genome
respective intervals. We use equality in the HOBit project", 1994
distance  to define intervals. The HOBit distance http://citeseer.nj.nec.com/goodman94glimpse.html.
between two integer coordinates is equal to the  L. Kaufman and P.J. Rousseeuw, "Finding
number of digits by which they have to be right Groups in Data: An Introduction to Cluster
shifted to make them identical. The number of Analysis", New York: John Wiley & Sons, 1990.
intervals that have distinct HOBit distances is equal  R. Ng and J. Han, "Efficient and effective
to the number of bits of the represented numbers. clustering method for spatial data mining", In Proc.
For more than one attribute the the HOBit distance is 1994 Int. Conf. Very Large Data Bases (VLDB'94),
defined as the maxium of the individual HOBit p. 144-155, Santiago, Chile, Sept. 1994.
distances of the attributes.  M. Ester, H.-P. Kriegel, J. Sander, and X. Xu,
"Knowledge discovery in large spatial databases:
The range of all data items with a HOBit distance Focusing techniques for efficient class
smaller than or equal to dH corresponds to a d- identification", In Proc. 4th Int. Symp. Large Spatial
dimensional hyper cube that is part of the concept Databases (SSD'95), p 67-82, Portland, ME, Aug.
hierarchy that P-trees represent. Therefore counts 1995.
can be efficiently calculated by "and"ing P-trees.  P. Bradley, U. Fayyard, and C. Reina, "Scaling
The calculation is done in analogy to a Podium clustering algorithms to large databases", In Proc.
Nearest Neighbor classification algorithm using P- 1998 Int. Conf. Knowledge Discovery and Data
trees . Mining (KDD'98), p. 9-15, New York, Aug. 1998
 A. Hinneburg and D. A. Keim, "An efficient
Once we have calculated the weighted number of approach to clustering in large multimedia databases
points for a given location in attribute space as well with noise", In Proc. 1998 Int. Conf. Knowledge
Discovery and Data Mining (KDD'98), p. 58-65,
New York, Aug. 1998.
 M. Dash, H. Liu, X. Xu, "1+1>2: Merging
distance and density based clustering",
 J. MacQueen, "Some methods for classification
and analysis of multivariate observations", Prc. 5th
Berkeley Symp. Math. Statist. Prob., 1:281-297,
 Qin Ding, Maleq Khan, Amalendu Roy, and
William Perrizo, "P-tree Algebra", ACM Symposium
on Applied Computing (SAC'02), Madrid, Spain,
 Qin Ding, Qiang Ding, William Perrizo,
"Association Rule Mining on remotely sensed
images using P-trees", PAKDD-2002, Taipei,
 Maleq Khan, Qin Ding, William Perrizo, "K-
Nearest Neighbor classification of spatial data
streams using P-trees", PAKDD-2002, Taipei,
Taiwan, May, 2002.
 Willy Valdivia-Granda, Edward Deckard,
William Perrizo, Qin Ding, Maleq Khan, Qiang
Ding, Anne Denton, "Biological systems and data
mining for phylogenomic expression profiling",
submitted to ACM SIGKDD 2002.