Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Parameter-Free Spatial Data Mining Using MDL. S. Papadimitriou, A. Gionis, P. Tsaparas, R.A. Väisänen, H. Mannila, and C. Faloutsos. International Conference on Data Mining 2005
  2. 2. Problems: <ul><li>Finding patterns of spatial correlation and feature co-occurrence. </li></ul><ul><ul><li>Automatically </li></ul></ul><ul><ul><ul><li>That is, parameter-free. </li></ul></ul></ul><ul><ul><li>Simultaneously </li></ul></ul><ul><li>For example: </li></ul><ul><ul><li>Spatial locations on a grid. </li></ul></ul><ul><ul><li>Features correspond to species present in specific cells. </li></ul></ul><ul><ul><li>Each pair of cell and species is 0 or 1, depending on species present in that cell. </li></ul></ul><ul><ul><li>Feature co-occurrence: </li></ul></ul><ul><ul><ul><li>Cohabitation of species. </li></ul></ul></ul><ul><ul><li>Spatial correlation: </li></ul></ul><ul><ul><ul><li>Natural habitats for species. </li></ul></ul></ul>
  3. 3. Motivation: <ul><li>Many applications </li></ul><ul><ul><li>Biodiversity Data </li></ul></ul><ul><ul><ul><li>As we just demonstrated. </li></ul></ul></ul><ul><ul><li>Geographical Data </li></ul></ul><ul><ul><ul><li>Presence of facilities on city blocks. </li></ul></ul></ul><ul><ul><li>Environmental Data </li></ul></ul><ul><ul><ul><li>Occurrence of events (storms, drought, fire, etc.) in various locations. </li></ul></ul></ul><ul><ul><li>Historical and Linguistic Data </li></ul></ul><ul><ul><ul><li>Occurrence of words in different languages/countries, historical events in a set of locations. </li></ul></ul></ul><ul><li>Existing methods either: </li></ul><ul><ul><li>Detect one pattern, but not both, or </li></ul></ul><ul><ul><li>Require user-input parameters. </li></ul></ul>
  4. 4. Background <ul><li>Minimum Description Length (MDL): </li></ul><ul><ul><li>Let L(D|M) denote the code length required to represent data D given (using) model M. Let L(M) be the complexity required to describe the model itself. </li></ul></ul><ul><ul><li>The total code length is then: </li></ul></ul><ul><ul><ul><li>L(D, M) = L(D|M) + L(M) </li></ul></ul></ul><ul><ul><li>This was used in SLIQ and is the intuitive notion behind the connection between data mining and data compression. </li></ul></ul><ul><ul><ul><li>The best model minimizes L(D, M), resulting in optimal compression. </li></ul></ul></ul><ul><ul><ul><li>Choosing the best model is a problem in its own right. </li></ul></ul></ul><ul><ul><ul><li>This will be explored further in the next paper I present. </li></ul></ul></ul>
  5. 5. Background <ul><li>Quadtree Compression </li></ul><ul><ul><li>Quadtrees: </li></ul></ul><ul><ul><ul><li>Used to index and reason about contiguous variable size grid regions (among other applications, mostly spatial). </li></ul></ul></ul><ul><ul><ul><ul><li>Used for 2D data; kD analogue is a kD-tree. </li></ul></ul></ul></ul><ul><ul><ul><li>“ Full Quadtree”: All nodes have either 0 or 4 children. </li></ul></ul></ul><ul><ul><ul><ul><li>Thus, all internal nodes correspond to a partitioning of a rectangular region into 4 subregions. </li></ul></ul></ul></ul><ul><ul><ul><li>Each quadtree’s structure corresponds to a unique partitioning. </li></ul></ul></ul><ul><ul><li>Transmission: </li></ul></ul><ul><ul><ul><li>If we only care about the structure (spatial partitioning), we can transmit a 0 for internal nodes and a 1 for leaves in depth-first order. </li></ul></ul></ul><ul><ul><ul><li>If we transmit the values as well, the cost is the number of leaves times the entropy of the leaf value distribution. </li></ul></ul></ul>
  6. 6. Example
  7. 7. Quadtree Encoding <ul><li>Let T be a quadtree with m leaf nodes, of which m p have value p. </li></ul><ul><ul><li>The total codelength is: </li></ul></ul><ul><ul><li>If we know the distribution of the leaf values, we can calculate this in constant time. </li></ul></ul><ul><ul><li>Updating the tree requires O(log n) time in the worst case, as part of the tree may require pruning. </li></ul></ul>
  8. 8. Binary Matrices / Bi-groupings: <ul><li>Bi-grouping: </li></ul><ul><ul><li>Simultaneous grouping of m rows and n columns into k and l disjoint row and column groups. </li></ul></ul><ul><ul><li>Let D denote an m x n binary matrix. </li></ul></ul><ul><ul><li>The cost of transmitting D is given as follows: </li></ul></ul><ul><ul><ul><li>Recall the MDL Principle: L(D) = L(D|M) + L(M). </li></ul></ul></ul><ul><ul><li>Let {Q x , Q y } be a bi-grouping. </li></ul></ul><ul><ul><li>Lemma (we will skip the proof): </li></ul></ul><ul><ul><ul><li>The codelength for transmitting an m -to- k mapping Q x where m p symbols are mapped to the value p is approximately: </li></ul></ul></ul>
  9. 9. Methodology <ul><li>Exploiting spatial locality: </li></ul><ul><ul><li>Bi-grouping as presented is nonspatial ! </li></ul></ul><ul><ul><li>To make it spatial, assign a non-uniform prior to possible groupings. </li></ul></ul><ul><ul><ul><li>That is, adjacent cells are more likely to belong to the same group. </li></ul></ul></ul><ul><li>Row groups correspond to spatial groupings. </li></ul><ul><ul><li>“ Neighborhoods” </li></ul></ul><ul><ul><li>“ Habitats” </li></ul></ul><ul><ul><li>Row groupings should demonstrate spatial coherence. </li></ul></ul><ul><li>Column groups correspond to “families”. </li></ul><ul><ul><li>“ Mountain birds” </li></ul></ul><ul><ul><li>“ Sea birds” </li></ul></ul><ul><li>Intuition </li></ul><ul><ul><li>Alternately group rows and columns iteratively until the total cost L(D) stops decreasing. </li></ul></ul><ul><ul><li>Finding the global optimum is very expensive. </li></ul></ul><ul><ul><ul><li>So our approach will use a greedy search for local optima. </li></ul></ul></ul>
  10. 10. Algorithms <ul><li>INNER: </li></ul><ul><ul><li>Group given the number of row and column groups. </li></ul></ul><ul><ul><li>Start with an arbitrary bi-grouping of matrix D into k row groups and l column groups. </li></ul></ul><ul><ul><li>do { </li></ul></ul><ul><ul><li>Let </li></ul></ul><ul><ul><li>for each row i from 1 to n </li></ul></ul><ul><ul><li>1 ≤ p ≤ k such that the “cost gain”: </li></ul></ul><ul><ul><li>is maximized. </li></ul></ul><ul><ul><li>Repeat for columns, producing the bi-grouping </li></ul></ul><ul><ul><li>t += 2 </li></ul></ul><ul><ul><li>} while (L(D) is decreasing) </li></ul></ul>
  11. 11. Algorithms <ul><li>OUTER: </li></ul><ul><ul><li>Finds the number of row and column groups. </li></ul></ul><ul><ul><li>Start with k 0 = l 0 = 1. </li></ul></ul><ul><ul><li>Split the row group p* with the maximum per-row entropy, holding the columns fixed. </li></ul></ul><ul><ul><li>Move each row in p* to a new group k T+1 iff doing so would decrease the per-row entropy of p*, resulting in a grouping </li></ul></ul><ul><ul><li>Assign group to the result of INNER </li></ul></ul><ul><ul><li>If the cost does not decrease, return </li></ul></ul><ul><ul><li>Otherwise, increment t and repeat. </li></ul></ul><ul><ul><li>Finally, perform this again for the columns. </li></ul></ul>
  12. 12. Complexity <ul><li>INNER is linear with respect to nonzero elements in D. </li></ul><ul><ul><li>Let nnz denote those elements. </li></ul></ul><ul><ul><li>Let k be the number of row groupings and l be the number of column groupings. </li></ul></ul><ul><ul><li>Row swaps are performed in the quadtree and take O(log m) time each, where m is the number of cells. </li></ul></ul><ul><ul><li>Let T be the iterations required to minimize the cost. </li></ul></ul><ul><ul><li>O(nnz * (k + l + log m) * T) </li></ul></ul><ul><li>OUTER, though quadratic with respect to (k + l ), is linear with respect to the dominating term nnz. </li></ul><ul><ul><li>Let n be the number of row splits. </li></ul></ul><ul><ul><li>O((k + l ) 2 nnz + (k + l ) n log m) </li></ul></ul>
  13. 13. Experiments <ul><li>NoisyRegions </li></ul><ul><ul><li>Three features (“species”) on a 32x32 grid. </li></ul></ul><ul><ul><ul><li>So D has 32x32 = 1024 rows. </li></ul></ul></ul><ul><ul><ul><li>And 3 columns. </li></ul></ul></ul><ul><ul><li>3% of each cell, chosen at random, has a wrong species, also randomly chosen. </li></ul></ul><ul><ul><li>The spatial and non-spatial groupings are shown to the right. </li></ul></ul><ul><ul><ul><li>Recall: Bi-grouping is not spatial by default. </li></ul></ul></ul><ul><ul><li>Spatial grouping reduces the total codelength. </li></ul></ul><ul><ul><li>The approach is not quite perfect due to the heuristic nature of the algorithm. </li></ul></ul>
  14. 14. Experiments <ul><li>Birds </li></ul><ul><ul><li>219 Finnish bird species over 3813 10x10km habitats. </li></ul></ul><ul><ul><li>Species are the features, habitats are cells. </li></ul></ul><ul><ul><ul><li>So our matrix is 3813x219. </li></ul></ul></ul><ul><ul><li>The spatial grouping is clearly more coherent. </li></ul></ul><ul><ul><li>Spatial grouping reveals Boreal zones: </li></ul></ul><ul><ul><ul><li>South Boreal: Light Blue and Green. </li></ul></ul></ul><ul><ul><ul><li>Mid Boreal: Yellow. </li></ul></ul></ul><ul><ul><ul><li>North Boreal: Red. </li></ul></ul></ul><ul><ul><li>Outliers are (correctly) grouped alone. </li></ul></ul><ul><ul><ul><li>Species with specialized habitats. </li></ul></ul></ul><ul><ul><ul><li>Or those reintroduced into the wild. </li></ul></ul></ul>
  15. 15. Other approaches <ul><li>Clustering </li></ul><ul><ul><li>k -means </li></ul></ul><ul><ul><ul><li>Variants using different estimates of central tendency: </li></ul></ul></ul><ul><ul><ul><ul><li>k -medoids, k -harmonic means, spherical k -means, … </li></ul></ul></ul></ul><ul><ul><ul><li>Variants determining k based on some criteria: </li></ul></ul></ul><ul><ul><ul><ul><li>X-means, G-means, … </li></ul></ul></ul></ul><ul><ul><li>BIRCH </li></ul></ul><ul><ul><li>CURE </li></ul></ul><ul><ul><li>DENCLUE </li></ul></ul><ul><ul><li>LIMBO </li></ul></ul><ul><ul><ul><li>Also information-theoretic. </li></ul></ul></ul><ul><li>Approaches either lossy, parametric, or aren’t easily adaptable to spatial data. </li></ul>
  16. 16. Room for improvement: <ul><li>Complexity </li></ul><ul><ul><li>O(n * log m) cost for reevaluating the quadtree codelength. </li></ul></ul><ul><ul><ul><li>O(log m) worst-case time for each reevaluation/row swap * n swaps. </li></ul></ul></ul><ul><ul><ul><li>However, the average-case complexity is probably much better. </li></ul></ul></ul><ul><ul><ul><li>If we know something about the data distribution, we might be able to reduce this. </li></ul></ul></ul><ul><ul><li>Faster convergence </li></ul></ul><ul><ul><ul><li>Fewer iterations, reducing the scaling factor T. </li></ul></ul></ul><ul><ul><ul><li>Rather than stopping only when there is no decrease in cost, perhaps stop when we fall below a threshold? (Introduces a parameter) </li></ul></ul></ul><ul><li>Accuracy </li></ul><ul><ul><li>The search will only find local optima, leading to errors. </li></ul></ul><ul><ul><li>We can employ some approaches used in annealing or genetic algorithms to attempt to find the global optimum. </li></ul></ul><ul><ul><ul><li>Randomly restarting in the search space, for example. </li></ul></ul></ul><ul><ul><ul><li>Stochastic gradient descent – similar to what we’re already doing, actually. </li></ul></ul></ul>
  17. 17. Conclusion <ul><li>Simultaneous and automatic grouping of spatial correlation and feature co-habitation. </li></ul><ul><li>Easy to exploit spatial locality. </li></ul><ul><li>Parameter-free. </li></ul><ul><li>Utilizes MDL: </li></ul><ul><ul><li>Minimizes the sum of the model cost and the data cost given the model. </li></ul></ul><ul><li>Efficient. </li></ul><ul><ul><li>Almost linear with the number of entries in the matrix. </li></ul></ul>
  18. 18. References <ul><li>S. Papadimitriou, A. Gionis, P. Tsaparas, R.A. Vaisanen, H. Mannila, C. Faloutsos, &quot;Parameter-Free Spatial Data Mining Using MDL&quot;, ICDM, Houston, TX, U.S.A., November 27-30, 2005. </li></ul><ul><li>M. Mehta, R. Agrawal and J. Rissanen, &quot;SLIQ: A Fast Scalable Classifier for Data Mining&quot;, in Proceedings of the 5th International Conference on Extending Database Technology, Avignon, France, Mar. 1996. </li></ul>
  19. 19. Thanks! <ul><li>Any questions? </li></ul>