Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Searching for Centers: An Efficient Approach to the Clustering of Large Data Sets Using P-trees number of data items makes it unsuited to large data Abstract sets. Many improvements have been implemented [3,4,5], such as CLARA [2] and CLARANS [3] but With the ever-increasing data-set sizes in most data they don't address the fundamental issue, namely that mining applications, speed remains a central goal in the algorithm inherently depends on the combined clustering. We present an approach that avoids both choice of cluster centers, and its complexity thereby the time complexity of partition-based algorithms must scale essentially as the square of the number of and the storage requirements of density-based ones, investigated sites. In this paper we analyze the while being based on the same fundamental premises origins of this unfavorable scaling and see how it can as standard partition- and density-based algorithms. be eliminate it at a fundamental level. Our idea is to Our idea is motivated by taking an unconventional make the criterion for a "good" cluster center perspective that puts three of the most popular independent of the locations of all other cluster clustering algorithms, k-medoids, k-means, and centers. At a fundamental level this replaces the center-defined DENCLUE into the same context. quadratic dependency on the number of investigated We suggest an implementation of our idea that uses sites by a linear dependency. P-trees 1 for efficient value-based data access. We note that our proposed solution is not entirely new but can be seen as implemented in the density- 1. Introduction based clustering algorithm DENCLUE [6] albeit with a different justification. This allows us to separate Many things change in data mining applications but representation issues from a more fundamental one fact is reliably staying the same, namely that complexity question when discussing the concept of next year's problems will involve larger data sets and an influence function as introduced in DENCLUE. tougher performance requirements than this year's. The datasets that are available to biological applications keep growing continuously, since there 2. Taking a Fresh Look at is commonly no reason to remove old data and Established Algorithms experiments worldwide contribute new data [1]. Network applications are operating on massive Partition-based and density-based algorithms are amounts of data, and there is no limit in sight to the commonly seen as fundamentally and technically increase in network traffic, the increase in detail of distinct, and proposed combinations work on an information that should be kept and evaluated, and applied rather than a fundamental level [7]. We will increasing demands on the speed with which data present three of the most popular techniques from should be analyzed. The World-Wide Web both categories in a context that allows us to see their constitutes another data mining area with common idea independently of their implementation. continuously growing "data set" sizes. The list could This will allow us to combine elements from each of be continued almost indefinitely. them and design an algorithm that is fast without requiring any clustering-specific data structures. It is therefore of ultimate importance to see where the scaling behavior of standard algorithms may be The existing algorithms we consider in detail are the improved without losing their benefits, so as not to k-medoids [2] and k-means [8] partitioning make them obsolete over time. A clustering techniques and the center-defined version of technique that has caused much research in this DENCLUE [6]. The goal of these algorithms is to direction is the k-medoids [2] algorithm. Although it group a data item with a cluster center that represents has a simple justification and useful clustering its properties well. The clustering process has two properties the default scaling behavior for its time parts that are strongly related for the algorithms we complexity as being proportional to the square of the review, but will be separated for our clustering algorithm. The first part consists in finding cluster 1 Ptree technology is patented to North Dakota State centers while the second specifies boundaries of the University clusters. We first look at strategies that are used to
  2. 2. determine cluster centers. Since the k-medoids calculation of forces and can be ignored. algorithm is commonly seen as producing a useful clustering, we start by reviewing its definition. 2.1. K-Medoids Clustering as a Search for Equilibrium A good clustering in k-medoids is defined through the minimum of a cost function. The most common choice of cost function is the sum of squared Euclidean distances between each data item and its closest cluster center. An alternative way of looking at this definition borrows ideas from physics: We can look at cluster centers as particles that are attracted to the data points. The potential that describes the attraction for each data item is taken to Figure 1: Energy landscape (black) and potential of be a quadratic function in the Euclidean distance as individual data items (gray) in k-medoids and k- defined in the d-dimensional space of all attributes. means The energy landscape surrounding a cluster center with position X(m) is the sum of the individual Cluster boundaries are given as the points of equal potentials of data items at locations X(i) distances to the closest cluster centers. This means N d that for the k-means and k-medoids algorithms the E ( X ( m ) ) = ∑∑ ( x (ji ) − x (jm ) ) 2 energy landscape depends on the cluster centers. The i =1 j =1 difference between k-means and k-medoids lies in where N is the number of data items that are assumed the way the system of all cluster centers is updated. to influence the cluster center. We defer the If cluster centers change, the energy landscape will discussion on the influence of cluster boundaries also change. The k-means algorithm moves cluster until later. It can be seen that the potential that centers to the current mean of the data points and arises from more than one data point will continue to thereby corresponds to a simple hill-climbing be quadratic, since the sum of quadratic functions is algorithm for the minimization of the total energy of again a quadratic function. We can calculate the all cluster centers. (Note that the "hill-climbing" location of its minimum as follows: refers to a search for maxima whereas we are looking for minima.) For the k-medoids algorithm the ∂ N E ( X ( m ) ) = −2∑ ( x (ji ) − x (jm ) ) = 0 attempt is made to explore the space of all cluster- center locations completely. The reason why the k- ∂x (jm ) i =1 medoids algorithm is so much more robust than k- Therefore we can see that the minimum of the means therefore can be traced to their different potential is the mean of coordinates of the data update strategies. points to which it is attracted. 1 N (i ) 2.2. Replacing a Many-Body Problem x (jm ) = ∑xj N i =1 by a Single-Body Problem This result may surprise since it suggests that the We have now seen that the energy landscape that potential minima and thereby the equilibrium cluster centers feel depends on the cluster positions for the cluster centers in the k-medoids boundaries, and thereby on the location of all other algorithm should be the mean, or rather the data item cluster centers. In the physics language the problem closest to it, given the constraint that k-medoids that has to be solved is a many-body problem, cluster centers must be data items. This may seem because many cluster centers are simultaneously surprising since the k-medoids algorithm is known to involved in the minimization. Recognizing the be significantly more robust than k-means which inherent complexity of many-body problems we explicitly takes means as cluster centers. In order to consider ways of redefining our problem such that understand this seemingly inconsistent result we we can look at one cluster center at a time, i.e., have to remember that the k-medoids and k-means replacing the many-body problem with a single-body algorithms look not only for an equilibrium position problem. Our first idea may be to simply ignore all of the cluster center within any one cluster, but rather but one cluster center while keeping the quadratic for a minimum of the total energy of all cluster potential of all data points. Clearly this is not going centers. To understand this we now have to look at to provide us with a useful energy landscape: If a how cluster boundaries would be modeled in the cluster center feels the quadratic potential of all data analogous physical system. Since data items don't points there will only be one minimum in the energy attract cluster centers that are located outside of their landscape and that will be the mean of all data points cluster, we model their potential as being quadratic - a trivial result. Let us therefore analyze what within a cluster and continuing as a constant outside caused the non-trivial result in the k-medoids / k- a cluster. Constant potentials are irrelevant for the
  3. 3. means case: Each cluster center only interacted with considered valid cluster centers because each new data points that were close to it, namely in the same choice of one cluster center will change the cost or cluster. A natural idea is therefore to limit the range "energy" for all others. Decoupling cluster centers of attraction of data points independently of any immediately reduces the complexity to being linear cluster shapes or locations. Limiting the range of an in the search space. Using a Gaussian influence attraction corresponds to letting the potential function or potential achieves the goal of decoupling approach a constant at large distances. We therefore cluster centers while leading to results that have been look for a potential that is quadratic for small proven useful in the context of the density-based distances and approaches a constant for large ones. clustering method DENCLUE. A natural choice for such a potential is a Gaussian function. 3.1. Searching for Equilibrium Locations of Cluster Centers We view the clustering process as a search for equilibrium in an energy landscape that is given by the sum of the Gaussian influences of all data points X(i) ( d ( X , X ( i ) )) 2 − E ( X ) = −∑ e 2σ 2 i where the distance d is taken to be the Euclidean distance calculated in the d-dimensional space of all attributes d Figure 2: Energy landscape (black) and potential of d ( X , X (i ) ) = ∑(x j − x (ji ) ) 2 individual data items (gray) for a Gaussian influence j =1 function analogous to DENCLUE It is important to note that the improvement in efficiency by using this configuration-independent The potential we have motivated can easily be function rather than the k-medoids cost function, is identified with a Gaussian influence function in the unrelated to the representation of the data points. density-based algorithm DENCLUE [6]. Note that DENCLUE takes the density-based approach of the constant shifts and opposite sign that distinguish representing data points in the space of their attribute the potential that arise from our motivation from the values. This design choice does not automatically one used in DENCLUE do not affect the follow from a configuration-independent influence optimization problem. Similarly, we can identify the function. energy with the (negative of the) density landscape in DENCLUE. This observation allows us to draw In fact one could envision a very simple immediate conclusions on the quality of cluster implementation of this idea in which starting points centers generated by our approach: DENCLUE are chosen in a random or equidistant fashion and a cluster centers have been shown to be as useful as k- hill-climbing method is implemented that optimizes medoid ones [6]. We would not expect them to be all cluster center candidates simultaneously using identical because we are solving a slightly different one database scan in each optimization step. In this problem, but it is not clear apriori, which definition paper we will describe a method that uses a general- of a cluster center is going to result in a better purpose data structure, namely a P-tree that gives us clustering quality. fast value-based access to counts. The benefits of this implementation are that starting positions can be chosen efficiently, and optimization can be done for 3. Idea one cluster center candidate at a time, allowing a more informed choice of the number of candidates. Having motivated a uniform view of partition clustering as a search for equilibrium of cluster As a parameter for our minimization we have to centers in an energy landscape of attracting data choose the width of the Gaussian function, σ. σ items we now proceed to describe our own specifies the range for which the potential algorithm. It is clear that the computational approximates a quadratic function. Two Gaussian complexity of a problem that can be solved for each functions have to be at least 2σ apart to be separated cluster center independently will be significantly by a minimum. That means that the smallest clusters smaller than the complexity of minimizing a function we can get have diameter 2σ. The number of of all cluster centers. The state space that has to be clusters that our algorithm finds will be determined searched for a k-medoid based algorithm must scale by σ rather than being predefined as for k-medoids / as the square of the number of sites that are k-means. For areas in which data points are more
  4. 4. widely spaced than 2σ each data point would be starting point. To find the next starting point we considered an individual cluster. This is undesirable select the largest count we have ignored so far and since widely spaced points are likely to be due to repeat the iterative process. We compare total noise. We will exclude them and group the counts, which are commonly larger at higher levels. corresponding data points with the nearest larger Therefore starting points are likely to be in different cluster instead. In our algorithm this corresponds to high-level hyper cubes, i.e. well separated. This is ignoring starting points for which the potential desirable since points that have high counts but are minimum is not deeper than a threshold of -ξ. close are likely to belong to the same cluster. We continue the process of deriving new starting points until minimization consistently terminates in cluster 3.2. Defining Cluster Boundaries centers that have been discovered previously. Our algorithm fundamentally differs from DENCLUE in that we will not try to map out the 4. Algorithm using P-Trees entire space. We will instead rely on estimates as to whether our sampling of space is sufficient for We describe an implementation that uses P-trees, a finding all or nearly all minima. That means that we data structure that has been shown to be appropriate replace the complete mapping in DENCLUE with a for many data mining tasks [9,10,11]. search strategy. As a consequence we will get no information on cluster boundaries. Center-defined 4.1. A Summary of P-Tree Features DENCLUE considers data points as cluster members if they are density attracted to the cluster center. P-trees represent a non-key attribute in the domain This approach is consistent in the framework of space of the key attributes of a relation. One P-tree density-based clustering but there are drawbacks to corresponds to one bit of one non-key attribute. It the definition. For many applications it is hard to maintains pre-computed counts in a hierarchical argue that a data point is considered as belonging to fashion. Total counts for given non-key attribute a cluster other than the one defined by the cluster values or ranges can easily be computed by an "and" center it is closest to. If one cluster has many operation on separate P-trees. This "and" operation members in the data set used while clustering it will can also be used to derive counts at different levels appear larger than its neighbors with few members. within the tree, corresponding to different values or Placing the cluster boundary according to the ranges of the key attribute. attraction regions would only be appropriate if we could be sure that the distribution will remain the Two situations can be distinguished: The keys of a same for any future data set. This is a stronger relation may either be data mining relevant assumption than the general clustering hypothesis themselves or they may only serve to distinguish that the cluster centers that represent data points well tuples. Examples of relations in which the keys are will be the same. not data mining relevant are data streams where time is the key of the relation but will often not be We will therefore keep the k-means / k-medoids included as a dimension in data mining. Similarly, definition of a cluster boundary that always places in spatial data mining, geographical location is data points with the cluster center they are closest to. usually the key of the relation but is commonly not Not only does this approach avoid the expensive step included in the mining process. In other situations of precisely mapping out the shape of clusters. It the keys of the relation do themselves contain data also allows us to determine cluster membership by a mining relevant information. Examples of such simple distance calculation and without the need of relations are fact tables in data warehouses, where referring to an extensive map. one fact, such as "sales", is given in the space of all attributes that are expected to affect it. Similarly, in 3.3. Selecting Starting Points genomics, gene expression data is commonly represented within the space of parameters that are An important benefit of using proximity as the expected to influence it. In this case the P-tree based definition for cluster membership is that we can representation is similar to density-based choose starting points for the optimization that are representations such as the one used in DENCLUE. already high in density and thereby reduce the One challenge of such a representation is that it has number of optimization steps. high demands on storage. P-trees use efficient compression in which subtrees that consist entirely Our heuristics for finding good starting points is as of 0 values or entirely of 1 values are removed. follows: We start by breaking up the n-dimensional Additional problems man arise if data is represented hyper space into 2n hyper cubes by dividing each in the space of non-key attributes: information may dimension into two equally sized sections. We select be lost because the same point in space may the hypercube with the largest total count of data represent many tuples. For a P-tree based approach points while keeping track of the largest counts we we only represent data in the space of the data are ignoring. The process is iteratively repeated until mining relevant attributes if these attributes are keys. we reach the smallest granularity that our In either case we get fast value-based access to representation affords. The result specifies our first counts through "and" operations.
  5. 5. as for 2 neighboring points in each dimension P-trees not only give us access to counts based on the (distance s) we can proceed in standard hill-climbing values of attributes in individual tuples, they also fashion. We replace central point with the point that contain information on counts at any level in a has the lowest energy, assuming it lowers the energy. hierarchy of hyper cubes where each level If no point has lower energy, the step size s is corresponds to a refinement by a factor of two in reduced. If the step size is already at its minimum each attribute dimension. Note that this hierarchy is we consider the point a cluster center. If the cluster not necessarily represented as the tree structure of center has already been found in a previous the P-tree. For attributes that are not themselves minimization we ignore it. If we repeatedly keys to a relation, the values in the hierarchy are rediscover old cluster centers we stop. derived as needed by successive "and" operations, starting with the highest order bit, and successively "and"ing with lower order bits. 5. Conclusions We have shown how the complexity of the k- 4.2. The Algorithm medoids clustering algorithm can be addressed at a fundamental level and proposed an algorithm that Our algorithm has two steps that are iterated for each makes use of our suggested modification. In the k- possible cluster center. The goal of the first step is medoids algorithm a cost function is calculated in to find a good initial starting point by looking for an which the contribution of any one cluster center attribute combination with a high count in an area depends on every other cluster center. This that has not previously been used. Note that in a dependency can be avoided if the influence of far- situation where the entire key of the relation is away data items is limited in a configuration- among the data mining relevant attributes, the count independent fashion. We suggest using a Gaussian of an individual attribute combination can only be 0 function for this purpose and identify it with the or 1. In such a case we stay at a higher level in the Gaussian influence in the density-based clustering hierarchy when determining a good starting point. In algorithm DENCLUE. Our derivation allows us to order to find a point with a high count, we start at the separate the representation issues that distinguish highest level in the hierarchy for the combination of density-based algorithms from partition-based ones all attributes that we consider. At every level we from the fundamental complexity issues that follow select the highest count while keeping track of other from the definition of the minimization problem. high counts that we ignore. This gives us a suitable We suggest an implementation that uses the most starting point in b steps where b is the number of efficient aspects of both approaches. Using P-trees levels in the hierarchy or the number of bits of the in our implementation allows us to further improve respective attributes. We directly use the new efficiency. starting point to find the nearest cluster center. As minimization step we evaluate neighboring points, with a distance (step size) s, in the energy landscape. This requires the fast evaluation of a References superposition of Gaussian functions. We intervalize [7] N. Goodman, S. Rozen, and L. Stein, "A glimpse distances and then calculate counts within the at the DBMS challenges posed by the human genome respective intervals. We use equality in the HOBit project", 1994 distance [11] to define intervals. The HOBit distance between two integer coordinates is equal to the [1] L. Kaufman and P.J. Rousseeuw, "Finding number of digits by which they have to be right Groups in Data: An Introduction to Cluster shifted to make them identical. The number of Analysis", New York: John Wiley & Sons, 1990. intervals that have distinct HOBit distances is equal [8] R. Ng and J. Han, "Efficient and effective to the number of bits of the represented numbers. clustering method for spatial data mining", In Proc. For more than one attribute the the HOBit distance is 1994 Int. Conf. Very Large Data Bases (VLDB'94), defined as the maxium of the individual HOBit p. 144-155, Santiago, Chile, Sept. 1994. distances of the attributes. [9] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, "Knowledge discovery in large spatial databases: The range of all data items with a HOBit distance Focusing techniques for efficient class smaller than or equal to dH corresponds to a d- identification", In Proc. 4th Int. Symp. Large Spatial dimensional hyper cube that is part of the concept Databases (SSD'95), p 67-82, Portland, ME, Aug. hierarchy that P-trees represent. Therefore counts 1995. can be efficiently calculated by "and"ing P-trees. [10] P. Bradley, U. Fayyard, and C. Reina, "Scaling The calculation is done in analogy to a Podium clustering algorithms to large databases", In Proc. Nearest Neighbor classification algorithm using P- 1998 Int. Conf. Knowledge Discovery and Data trees [12]. Mining (KDD'98), p. 9-15, New York, Aug. 1998 [3] A. Hinneburg and D. A. Keim, "An efficient Once we have calculated the weighted number of approach to clustering in large multimedia databases points for a given location in attribute space as well with noise", In Proc. 1998 Int. Conf. Knowledge
  6. 6. Discovery and Data Mining (KDD'98), p. 58-65, New York, Aug. 1998. [11] M. Dash, H. Liu, X. Xu, "1+1>2: Merging distance and density based clustering", [2] J. MacQueen, "Some methods for classification and analysis of multivariate observations", Prc. 5th Berkeley Symp. Math. Statist. Prob., 1:281-297, 1967. [4] Qin Ding, Maleq Khan, Amalendu Roy, and William Perrizo, "P-tree Algebra", ACM Symposium on Applied Computing (SAC'02), Madrid, Spain, 2002. [5] Qin Ding, Qiang Ding, William Perrizo, "Association Rule Mining on remotely sensed images using P-trees", PAKDD-2002, Taipei, Taiwan, 2002. [6] Maleq Khan, Qin Ding, William Perrizo, "K- Nearest Neighbor classification of spatial data streams using P-trees", PAKDD-2002, Taipei, Taiwan, May, 2002. [12] Willy Valdivia-Granda, Edward Deckard, William Perrizo, Qin Ding, Maleq Khan, Qiang Ding, Anne Denton, "Biological systems and data mining for phylogenomic expression profiling", submitted to ACM SIGKDD 2002.