Published on

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Spatial Data Mining: Three Case Studies For additional details www.cs.umn.edu/~shekhar/problems.html Shashi Shekhar, University of Minnesota Presented to UCGIS Summer Assembly 2001
  2. 2. Background <ul><li>NSF workshop on GIS and DM (3/99) </li></ul><ul><li>Spatial data [1, 8] - traffic, bird habitats, global climate, logistics, ... </li></ul><ul><li>For spatial patterns - outliers, location prediction, associations, sequential associations, trends, … </li></ul>
  3. 3. Framework <ul><li>Problem statement: capture special needs </li></ul><ul><li>Data exploration: maps, new methods </li></ul><ul><li>Try reusing classical methods </li></ul><ul><ul><li>from data mining, spatial statistics </li></ul></ul><ul><li>If reuse is not possible, invent new methods </li></ul><ul><li>Validation, Performance tuning </li></ul>
  4. 4. Case 1: Spatial Outliers <ul><li>Problem: stations different from neighbors [SIGKDD 2001] </li></ul><ul><li>Data - space-time plot, distr. Of f(x), S(x) </li></ul><ul><li>Distribution of base attribute: </li></ul><ul><ul><li>spatially smooth </li></ul></ul><ul><ul><li>frequency distribution over value domain: normal </li></ul></ul><ul><li>Classical test - Pr.[item in population] is low </li></ul><ul><ul><li>Q? distribution of diff.[f(x), neighborhood agg{f(x)}] </li></ul></ul><ul><ul><li>Insight: this statistic is distributed normally! </li></ul></ul><ul><ul><li>Test: (z-score on the statistics) > 2 </li></ul></ul><ul><ul><li>Performance - spatial join, clustering methods </li></ul></ul>
  5. 5. Spatial outlier detection [4] <ul><li>Spatial outlier </li></ul><ul><li>A data point that is extreme relative to </li></ul><ul><li>it neighbors </li></ul><ul><li>Given </li></ul><ul><li>A spatial graph G={V,E} </li></ul><ul><li>A neighbor relationship (K neighbors) </li></ul><ul><li>An attribute function f: V -> R </li></ul><ul><li>An aggregation function f aggr : R k -> R </li></ul><ul><li>Confidence level threshold  </li></ul><ul><li>Find </li></ul><ul><li>O = {v i | v i  V, v i is a spatial outlier} </li></ul><ul><li>Objective </li></ul><ul><li>Correctness: The attribute values of v i </li></ul><ul><li>is extreme, compared with its neighbors </li></ul><ul><li>Computational efficiency </li></ul><ul><li>Constraints </li></ul><ul><li>Attribute value is normally distributed </li></ul><ul><li>Computation cost dominated by I/O op. </li></ul>
  6. 6. Spatial outlier detection <ul><li>Spatial Outlier Detection Test </li></ul><ul><li>1. Choice of Spatial Statistic </li></ul><ul><li>S(x) = [f(x)–E y  N(x) (f(y))] </li></ul><ul><li>Theorem: S(x) is normally distributed </li></ul><ul><li>if f(x) is normally distributed </li></ul><ul><li>2. Test for Outlier Detection </li></ul><ul><li>| (S(x) -  s ) /  s | >  </li></ul><ul><li>Hypothesis </li></ul><ul><li>I/O cost determined by clustering efficiency </li></ul>f(x) S(x) Spatial outlier and its neighbors
  7. 7. Spatial outlier detection <ul><li>Results </li></ul><ul><li>1. CCAM achieves higher clustering efficiency (CE) </li></ul><ul><li>2. CCAM has lower I/O cost </li></ul><ul><li>3. Higher CE leads to lower </li></ul><ul><li>I/O cost </li></ul><ul><li>4. Page size improves CE for </li></ul><ul><li>all methods </li></ul>Z-order CCAM I/O cost CE value Cell-Tree
  8. 8. Case 2: Location Prediction <ul><li>Citations: SIAM DM Conf. 2001, SIGKDD DMKD 2000 </li></ul><ul><li>Problem: predict nesting site in marshes </li></ul><ul><ul><li>given vegetation, water depth, distance to edge, etc. </li></ul></ul><ul><li>Data - maps of nests and attributes </li></ul><ul><ul><li>spatially clustered nests, spatially smooth attributes </li></ul></ul><ul><li>Classical method: logistic regression, decision trees, bayesian classifier </li></ul><ul><ul><li>but, independence assumption is violated ! Misses auto-correlation ! </li></ul></ul><ul><ul><li>Spatial auto-regression (SAR), Markov random field bayesian classifier </li></ul></ul><ul><ul><li>Open issues: spatial accuracy vs. classification accurary </li></ul></ul><ul><ul><li>Open issue: performance - SAR learning is slow! </li></ul></ul>
  9. 9. Location Prediction [6, 7, 8] <ul><li>Given: </li></ul><ul><li>1. Spatial Framework </li></ul><ul><li>2. Explanatory functions: </li></ul><ul><li>3. A dependent function </li></ul><ul><li>4. A family of function mappings: </li></ul><ul><li>Find: A function </li></ul><ul><li>Objective: maximize </li></ul><ul><li>classification_accuracy </li></ul><ul><li>Constraints : </li></ul><ul><li>Spatial Autocorrelation exists </li></ul>Nest locations Distance to open water Vegetation durability Water depth
  10. 10. Evaluation: Changing Model <ul><li>Linear Regression </li></ul><ul><li>Spatial Regression </li></ul><ul><li>Spatial model is better </li></ul>
  11. 11. Evaluation: Changing measure New measure:
  12. 12. Case 3: Spatial Association Rules <ul><li>Citation: Symp. On Spatial Databases 2001 </li></ul><ul><li>Problem: Given a set of boolean spatial features </li></ul><ul><ul><li>find subsets of co-located features, e.g. (fire, drought, vegetation) </li></ul></ul><ul><ul><li>Data - continuous space, partition not natural, no reference feature </li></ul></ul><ul><li>Classical data mining approach: association rules </li></ul><ul><ul><li>But, Look Ma! No Transactions!!! No support measure! </li></ul></ul><ul><li>Approach: Work with continuous data without transactionizing it! </li></ul><ul><ul><li>confidence = Pr.[fire at s | drought in N(s) and vegetation in N(s)] </li></ul></ul><ul><ul><li>support: cardinality of spatial join of instances of fire, drought, dry veg. </li></ul></ul><ul><ul><li>participation: min. fraction of instances of a features in join result </li></ul></ul><ul><ul><li>new algorithm using spatial joins and apriori_gen filters </li></ul></ul>
  13. 13. Co-location Patterns [2, 3] Answers: and Can you find co-location patterns from the following sample dataset?
  14. 14. Co-location Patterns Can you find co-location patterns from the following sample dataset?
  15. 15. Co-location Patterns <ul><li>Spatial Co-location </li></ul><ul><li>A set of features frequently co-located </li></ul><ul><li>Given </li></ul><ul><li>A set T of K boolean spatial feature types T={f 1 ,f 2 , … , f k } </li></ul><ul><li>A set P of N locations P={p 1 , …, p N } in a spatial frame work S, p i  P is of some spatial feature in T </li></ul><ul><li>A neighbor relation R over locations in S </li></ul><ul><li>Find </li></ul><ul><li>T c =  subsets of T frequently co-located </li></ul><ul><li>Objective </li></ul><ul><li>Correctness </li></ul><ul><li>Completeness </li></ul><ul><li>Efficiency </li></ul><ul><li>Constraints </li></ul><ul><li>R is symmetric and reflexive </li></ul><ul><li>Monotonic prevalence measure </li></ul>Reference Feature Centric Window Centric Event Centric
  16. 16. Co-location Patterns <ul><li>Participation index </li></ul><ul><li>Participation ratio pr(f i , c) of feature f i in co-location c = {f 1 , f 2 , …, f k }: fraction of instances of f i with </li></ul><ul><li>feature {f 1 , …, f i-1 , f i+1 , …, f k } nearby 2.Participation index = min{pr(f i , c)} </li></ul><ul><li>Algorithm </li></ul><ul><li>Hybrid Co-location Miner </li></ul>Comparison with association rules Pr.[ A in N(L) | B at L ] Pr.[ A in T | B in T ] conditional probability measure neighborhoods transactions collections events /Boolean spatial features item-types item-types support discrete sets Association rules Co-location rules participation index prevalence measure continuous space underlying space
  17. 17. Conclusions & Future Directions <ul><li>Spatial domains may not satisfy assumptions of classical methods </li></ul><ul><ul><li>data: auto-correlation, continuous geographic space </li></ul></ul><ul><ul><li>patterns: global vs. local, e.g. spatial outliers vs. outliers </li></ul></ul><ul><ul><li>data exploration: maps and albums </li></ul></ul><ul><li>Open Issues </li></ul><ul><ul><li>patterns: hot-spots, blobology (shape), spatial trends, … </li></ul></ul><ul><ul><li>metrics: spatial accuracy(predicted locations), spatial contiguity(clusters) </li></ul></ul><ul><ul><li>spatio-temporal dataset </li></ul></ul><ul><ul><li>scale and resolutions sentivity of patterns </li></ul></ul><ul><ul><li>geo-statistical confidence measure for mined patterns </li></ul></ul>
  18. 18. References <ul><li>S. Shekhar, S. Chawla, S. Ravada, A. Fetterer, X. Liu and C.T. Liu, “ Spatial Databases: Accomplishments and Research Needs”, IEEE Transactions on Knowledge and Data Engineering, Jan.-Feb. 1999. </li></ul><ul><li>S. Shekhar and Y. Huang, “Discovering Spatial Co-location Patterns: a Summary of Results”, In Proc. of 7th International Symposium on Spatial and Temporal Databases (SSTD01), July 2001. </li></ul><ul><li>S. Shekhar, Y. Huang, and H. Xiong, “Performance Evaluation of Co-location Miner”, the IEEE International Conference on Data Mining (ICDM’01), Nov. 2001. (submitted) </li></ul><ul><li>S. Shekhar, C.T. Lu, P. Zhang, &quot;Detecting Graph-based Spatial Outliers: Algorithms and Applications“, the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001. </li></ul><ul><li>S. Shekhar, S. Chawla, the book “Spatial Database: Concepts, Implementation and Trends”. (To be published in 2001 ) </li></ul><ul><li>S. Chawla, S. Shekhar, W. Wu and U. Ozesmi , “Extending Data Mining for Spatial Applications: A Case Study in Predicting Nest Locations”, Proc. Int. Confi. on 2000 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD 2000), Dallas, TX, May 14, 2000. </li></ul><ul><li>S. Chawla, S. Shekhar, W. Wu and U. Ozesmi, “Modeling Spatial Dependencies for Mining Geospatial Data”, First SIAM International Conference on Data Mining, 2001. </li></ul><ul><li>S. Shekhar, P.R. Schrater, R. R. Vatsavai, W. Wu, and S. Chawla, “Spatial Contextual Classification and Prediction Models for Mining Geospatial Data”, IEEE Transactions on Multimedia, 2001. (Submitted) </li></ul>Some papers are available on the Web sites: http://www.cs.umn.edu/research/shashi-group/