Published on

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. CS599 Spatial & Temporal Database <ul><li>Spatial Data Mining: Progress and Challenges </li></ul><ul><li>Survey Paper </li></ul><ul><li>appeared in DMKD96 </li></ul><ul><li>by Koperski, K., Adhikary, J. and Han, J. </li></ul><ul><li>Simon Fraser University, Canada </li></ul><ul><li>represented by Chung-hao Tan </li></ul><ul><li>Nov.16.2000 </li></ul>
  2. 2. Outlines <ul><li>What is data mining? </li></ul><ul><li>What is spatial data mining? </li></ul><ul><li>Generalization-based knowledge discovery. </li></ul><ul><li>Clustering-based analysis. </li></ul><ul><li>Exploring spatial association rules. </li></ul><ul><li>Mining in image database. </li></ul><ul><li>Future direction & conclusion. </li></ul>
  3. 3. What Is Data Mining? <ul><li>A short definition: </li></ul><ul><li>“ extracting implicit knowledge from large amount of data.” </li></ul><ul><li>The form of discovered knowledge: </li></ul><ul><ul><li>Regression and classification. </li></ul></ul><ul><ul><li>Association rules. </li></ul></ul><ul><ul><li>Clustering. </li></ul></ul><ul><li>What can be contributed by database research? </li></ul><ul><ul><li>Efficient data access method (indexing). </li></ul></ul><ul><ul><li>Query optimizer. </li></ul></ul><ul><ul><li>Data integration. </li></ul></ul><ul><ul><li>… </li></ul></ul><ul><ul><li>=> Data Warehousing research provides a convenient platform for data mining. </li></ul></ul>
  4. 4. An Example of Data Mining Technique <ul><li>Example: </li></ul><ul><ul><li>Data: </li></ul></ul><ul><ul><li>Stock trading data (price, size, number of trades, etc.). </li></ul></ul><ul><ul><li>Query: </li></ul></ul><ul><ul><li>Given the current and past trading information, can you tell me whether it will go up or go down in the next minute? </li></ul></ul><ul><ul><li>Method: </li></ul></ul><ul><ul><li>Bayesian CART model search (Chipman, 1997). </li></ul></ul><ul><ul><li>=> try to find a classification or regression tree to model the data. </li></ul></ul><ul><ul><li>Result: </li></ul></ul><ul><ul><li>1. Reduce the misclassification rate from 53% to 30%. </li></ul></ul><ul><ul><li>2. Identify those important classification rules. </li></ul></ul><ul><ul><li>3. Identify those important variables (predictors). </li></ul></ul>
  5. 5. An Example of Data Mining Technique (Cont.)
  6. 6. An Example of Data Mining Technique (Cont.)
  7. 7. What Is Spatial Data Mining? <ul><li>A short definition: </li></ul><ul><ul><li>Extraction of implicit knowledge, spatial relations, or other patterns not explicitly stored in spatial database. </li></ul></ul><ul><li>Benefits: </li></ul><ul><ul><li>Understand spatial data; query optimization. </li></ul></ul><ul><ul><li>Discover relationships between spatial data and non-spatial data. </li></ul></ul><ul><ul><li>Construction of spatial knowledge base (e.g. associations). </li></ul></ul><ul><li>Application: </li></ul><ul><ul><li>GIS. </li></ul></ul><ul><ul><li>Image database exploration. </li></ul></ul><ul><ul><li>Robot navigation. </li></ul></ul><ul><ul><li>… (any applications which use spatial data). </li></ul></ul>
  8. 8. Primitives of Spatial Data Mining <ul><li>Spatial characteristic rules: </li></ul><ul><ul><li>A general description of spatial data. </li></ul></ul><ul><ul><li>E.g. price range of houses in various regions. </li></ul></ul><ul><li>Spatial discriminating rules: </li></ul><ul><ul><li>A general description of comparison among spatial data. </li></ul></ul><ul><ul><li>E.g. a comparison of price ranges of houses in various regions. </li></ul></ul><ul><li>Spatial association rules: </li></ul><ul><ul><li>Implication of one or a set of features by another set of features. </li></ul></ul><ul><ul><li>E.g. house near beach -> is expensive. </li></ul></ul>
  9. 9. Primitives of Spatial Data Mining (Cont.) <ul><li>Thematic maps: </li></ul><ul><ul><li>Present the spatial distribution of a single or a few attributes. </li></ul></ul><ul><ul><li>E.g. Temperature thematic map. </li></ul></ul><ul><ul><li>Data stored by raster image or vector image. </li></ul></ul><ul><li>Image database: </li></ul><ul><ul><li>A special kind of spatial database where data almost entirely consists of image or pictures (e.g. satellite image or medical image). </li></ul></ul><ul><ul><li>These images have coordination properties. </li></ul></ul>
  10. 10. Data Mining Architecture <ul><li>An example: (by Matheus, 1993) </li></ul>
  11. 11. Mining By Statistic Methods <ul><li>Methods: </li></ul><ul><ul><li>Regression model. </li></ul></ul><ul><li>Disadvantage. </li></ul><ul><ul><li>Assumption of statistical independence among the spatially distributed data. </li></ul></ul><ul><ul><li>Need experts’ domain knowledge (in spatial data). </li></ul></ul><ul><ul><li>Cannot model non-linear rules or symbolic values very well. </li></ul></ul><ul><ul><li>Do not work well with incomplete or inconclusive data. </li></ul></ul>
  12. 12. Generalization-based Method <ul><li>Ideas: </li></ul><ul><ul><li>Learning from examples. </li></ul></ul><ul><ul><li>Combined with generalization . </li></ul></ul><ul><li>Concept hierarchy. </li></ul><ul><ul><li>Explicitly given by the domain experts. </li></ul></ul><ul><ul><li>Higher levels are more general terms. </li></ul></ul><ul><li>Attributed-oriented induction: </li></ul><ul><ul><li>Performed by climbing the generalization hierarchies and summarizing the general relationships between spatial and non-spatial data at higher concept levels . </li></ul></ul><ul><ul><li>Until reaching a generalization threshold . </li></ul></ul>
  13. 13. Spatial-data-dominant Generalization <ul><li>Ideas: </li></ul><ul><ul><li>First step: Spatial-oriented induction. </li></ul></ul><ul><ul><li>Merging spatial regions according to the spatial concept hierarchy. </li></ul></ul><ul><ul><li>Second step: Attribute-oriented induction. </li></ul></ul><ul><ul><li>Non-spatial data at each merged regions are generalized at a given level by the threshold. </li></ul></ul>
  14. 14. Non-spatial-data-dominant Generalization <ul><li>Ideas: </li></ul><ul><ul><li>First step: Attribute-oriented induction. </li></ul></ul><ul><ul><li>Non-spatial data are generalized at a given level by the threshold. </li></ul></ul><ul><ul><li>Second step: Spatial-oriented induction. </li></ul></ul><ul><ul><li>Merging spatial regions which have the same non-spatial description. Ignore those small regions with different non-spatial descriptions but inside a large merged region. </li></ul></ul>
  15. 15. Generalization-based Method (Cont.)
  16. 16. Clustering-based Method <ul><li>Ideas: </li></ul><ul><ul><li>Clusters can be found without using any background knowledge. </li></ul></ul><ul><ul><li>Unsupervised learning. </li></ul></ul><ul><ul><li>Methods: </li></ul></ul><ul><ul><li>PAM – Repeat to find a better k representatives by trying all possible pairs of combinations. </li></ul></ul><ul><ul><li>CLARA – Same as PAM, but using a subset of data as samples. </li></ul></ul><ul><ul><li>CLARANS – Same as PAM, but randomly changing the samples at each iteration. </li></ul></ul>
  17. 17. SD-CLARANS <ul><li>Ideas: </li></ul><ul><ul><li>First step: Spatial-oriented induction. </li></ul></ul><ul><ul><li>Spatial-relevant data are collected and clustered. </li></ul></ul><ul><ul><li>Second step: Attributed-oriented induction. </li></ul></ul><ul><ul><li>Find out the non-spatial description of objects in each cluster . </li></ul></ul>
  18. 18. NSD-CLARANS <ul><li>Ideas: </li></ul><ul><ul><li>First step: Attributed-oriented induction. </li></ul></ul><ul><ul><li>Produce a number of generalized tulples. </li></ul></ul><ul><ul><li>Second step: Spatial-oriented induction. </li></ul></ul><ul><ul><li>For each such generalized tuple, all spatial components are collected and clustered. </li></ul></ul>
  19. 19. Other Issues In Clustering <ul><li>Need a fast access method to the spatial data (e.g. R*-tree). </li></ul><ul><li>Focus on relevant data only. </li></ul><ul><li>Using CF tree (for example) to store clustered results: </li></ul><ul><ul><li>A tuple of data is incrementally inserted into the closet leaf node (a sub-cluster). </li></ul></ul><ul><ul><li>If the diameter of the sub-cluster exceeds a threshold after insertion, split that leaf node. </li></ul></ul><ul><ul><li>Each internal node contains a Clustering Feature (CF). </li></ul></ul><ul><ul><li>CF = (N, LS, SS) N: #points in the sub-cluster. </li></ul></ul><ul><ul><li>LS: linear sum of the N points. </li></ul></ul><ul><ul><li>SS: square sum of the N points. </li></ul></ul><ul><ul><li>Linear scalability; insensibility to the input order; good quality of clustering. </li></ul></ul>
  20. 20. Exploring Spatial Associations <ul><li>Example: </li></ul><ul><ul><li>Is_a(x, school) -> close_to(x, park) 80%. </li></ul></ul><ul><ul><li>Topological relations : intersect, overlap, disjoint… </li></ul></ul><ul><ul><li>Spatial orientation : left_of, west_of… </li></ul></ul><ul><ul><li>Distance information : close_to, far_away… </li></ul></ul><ul><li>Minimum Support: </li></ul><ul><ul><li>Ignore those rules with small number of evidences. </li></ul></ul><ul><ul><li>E.g. Ignore the relation associating only 5% house in that area and a single school. </li></ul></ul><ul><ul><li>Strong rule : A rule with large support (exceeds the minimum support threshold). </li></ul></ul><ul><li>Minimum Confidence: </li></ul><ul><ul><li>Filter out those rules with low confidence. </li></ul></ul><ul><ul><li>E.g. Ignore the relations X->Y with only 5% confidence. </li></ul></ul>
  21. 21. Multi-level Spatial Associations Rules <ul><li>Using tree to explore : </li></ul><ul><ul><li>Collect task-relevant data. </li></ul></ul><ul><ul><li>Computation starts at high level of spatial predicates like close_to . </li></ul></ul><ul><ul><li>Utilize spatial indexing methods. </li></ul></ul><ul><ul><li>For those pattern that pass the filtering at the high levels, do further refinements at the lower levels , like adjacent_to , intersects , distance_less_than_x , etc. </li></ul></ul><ul><ul><li>Filter out those patterns that do not exceed Minimum Support Threshold or Minimum Confidence Threshold . </li></ul></ul><ul><ul><li>Derive the strong association rules! </li></ul></ul>
  22. 22. Using Approximation and Aggregation <ul><li>Ideas: </li></ul><ul><ul><li>Instead of asking “ where the clusters in the spatial database? ”, we want to know “ what are the characteristics of the clusters in terms of the features that are close to them? ” </li></ul></ul><ul><ul><li>E.g. “90% of the expensive house in a cluster are close to a lake”. </li></ul></ul><ul><ul><li>Using computational geometry concept. </li></ul></ul><ul><ul><li>First step: Eliminate unnecessary features. </li></ul></ul><ul><ul><li>Second step: Calculate the aggregate proximity of points in the cluster to the convex boundary of each features . </li></ul></ul><ul><ul><li>Experiment result: processing 50,000 features within 2 seconds. </li></ul></ul>
  23. 23. Mining In Image Database <ul><li>Ideas: </li></ul><ul><ul><li>Mining useful information in image database. </li></ul></ul><ul><ul><li>Example: Automatically identify volcano on the surface of Venus from images transmitted by the spacecraft. </li></ul></ul><ul><ul><li>Question: Is the above example related to spatial data mining research? </li></ul></ul>
  24. 24. Future Directions <ul><li>Data mining in spatial object-oriented database. </li></ul><ul><li>Mining under uncertainty. </li></ul><ul><li>Alternative Clustering Techniques. </li></ul><ul><li>Mining spatial data deviation and evolution rules. </li></ul><ul><li>Using multiple thematic maps. </li></ul><ul><li>Interleaved generalization. </li></ul><ul><li>Generalization using temporal spatial data. </li></ul><ul><li>Spatial Data Mining Query Language. </li></ul><ul><li>Multidimensional rule visualization. </li></ul>
  25. 25. Conclusion <ul><li>What is spatial data mining? </li></ul><ul><li>(Non-)Spatial-data-dominant generalization </li></ul><ul><li>(Non-)Spatial-data-dominant clustering </li></ul><ul><li>Spatial association rules </li></ul><ul><li>Using approximation and aggregation </li></ul><ul><li>Mining in image database </li></ul>