Upcoming SlideShare
Loading in...5







Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

PowerPoint PowerPoint Presentation Transcript

  • Spatial Data Mining: Three Case Studies For additional details www.cs.umn.edu/~shekhar/problems.html Shashi Shekhar, University of Minnesota Presented to UCGIS Summer Assembly 2001
  • Background
    • NSF workshop on GIS and DM (3/99)
    • Spatial data [1, 8] - traffic, bird habitats, global climate, logistics, ...
    • For spatial patterns - outliers, location prediction, associations, sequential associations, trends, …
  • Framework
    • Problem statement: capture special needs
    • Data exploration: maps, new methods
    • Try reusing classical methods
      • from data mining, spatial statistics
    • If reuse is not possible, invent new methods
    • Validation, Performance tuning
  • Case 1: Spatial Outliers
    • Problem: stations different from neighbors [SIGKDD 2001]
    • Data - space-time plot, distr. Of f(x), S(x)
    • Distribution of base attribute:
      • spatially smooth
      • frequency distribution over value domain: normal
    • Classical test - Pr.[item in population] is low
      • Q? distribution of diff.[f(x), neighborhood agg{f(x)}]
      • Insight: this statistic is distributed normally!
      • Test: (z-score on the statistics) > 2
      • Performance - spatial join, clustering methods
  • Spatial outlier detection [4]
    • Spatial outlier
    • A data point that is extreme relative to
    • it neighbors
    • Given
    • A spatial graph G={V,E}
    • A neighbor relationship (K neighbors)
    • An attribute function f: V -> R
    • An aggregation function f aggr : R k -> R
    • Confidence level threshold 
    • Find
    • O = {v i | v i  V, v i is a spatial outlier}
    • Objective
    • Correctness: The attribute values of v i
    • is extreme, compared with its neighbors
    • Computational efficiency
    • Constraints
    • Attribute value is normally distributed
    • Computation cost dominated by I/O op.
  • Spatial outlier detection
    • Spatial Outlier Detection Test
    • 1. Choice of Spatial Statistic
    • S(x) = [f(x)–E y  N(x) (f(y))]
    • Theorem: S(x) is normally distributed
    • if f(x) is normally distributed
    • 2. Test for Outlier Detection
    • | (S(x) -  s ) /  s | > 
    • Hypothesis
    • I/O cost determined by clustering efficiency
    f(x) S(x) Spatial outlier and its neighbors
  • Spatial outlier detection
    • Results
    • 1. CCAM achieves higher clustering efficiency (CE)
    • 2. CCAM has lower I/O cost
    • 3. Higher CE leads to lower
    • I/O cost
    • 4. Page size improves CE for
    • all methods
    Z-order CCAM I/O cost CE value Cell-Tree
  • Case 2: Location Prediction
    • Citations: SIAM DM Conf. 2001, SIGKDD DMKD 2000
    • Problem: predict nesting site in marshes
      • given vegetation, water depth, distance to edge, etc.
    • Data - maps of nests and attributes
      • spatially clustered nests, spatially smooth attributes
    • Classical method: logistic regression, decision trees, bayesian classifier
      • but, independence assumption is violated ! Misses auto-correlation !
      • Spatial auto-regression (SAR), Markov random field bayesian classifier
      • Open issues: spatial accuracy vs. classification accurary
      • Open issue: performance - SAR learning is slow!
  • Location Prediction [6, 7, 8]
    • Given:
    • 1. Spatial Framework
    • 2. Explanatory functions:
    • 3. A dependent function
    • 4. A family of function mappings:
    • Find: A function
    • Objective: maximize
    • classification_accuracy
    • Constraints :
    • Spatial Autocorrelation exists
    Nest locations Distance to open water Vegetation durability Water depth
  • Evaluation: Changing Model
    • Linear Regression
    • Spatial Regression
    • Spatial model is better
  • Evaluation: Changing measure New measure:
  • Case 3: Spatial Association Rules
    • Citation: Symp. On Spatial Databases 2001
    • Problem: Given a set of boolean spatial features
      • find subsets of co-located features, e.g. (fire, drought, vegetation)
      • Data - continuous space, partition not natural, no reference feature
    • Classical data mining approach: association rules
      • But, Look Ma! No Transactions!!! No support measure!
    • Approach: Work with continuous data without transactionizing it!
      • confidence = Pr.[fire at s | drought in N(s) and vegetation in N(s)]
      • support: cardinality of spatial join of instances of fire, drought, dry veg.
      • participation: min. fraction of instances of a features in join result
      • new algorithm using spatial joins and apriori_gen filters
  • Co-location Patterns [2, 3] Answers: and Can you find co-location patterns from the following sample dataset?
  • Co-location Patterns Can you find co-location patterns from the following sample dataset?
  • Co-location Patterns
    • Spatial Co-location
    • A set of features frequently co-located
    • Given
    • A set T of K boolean spatial feature types T={f 1 ,f 2 , … , f k }
    • A set P of N locations P={p 1 , …, p N } in a spatial frame work S, p i  P is of some spatial feature in T
    • A neighbor relation R over locations in S
    • Find
    • T c =  subsets of T frequently co-located
    • Objective
    • Correctness
    • Completeness
    • Efficiency
    • Constraints
    • R is symmetric and reflexive
    • Monotonic prevalence measure
    Reference Feature Centric Window Centric Event Centric
  • Co-location Patterns
    • Participation index
    • Participation ratio pr(f i , c) of feature f i in co-location c = {f 1 , f 2 , …, f k }: fraction of instances of f i with
    • feature {f 1 , …, f i-1 , f i+1 , …, f k } nearby 2.Participation index = min{pr(f i , c)}
    • Algorithm
    • Hybrid Co-location Miner
    Comparison with association rules Pr.[ A in N(L) | B at L ] Pr.[ A in T | B in T ] conditional probability measure neighborhoods transactions collections events /Boolean spatial features item-types item-types support discrete sets Association rules Co-location rules participation index prevalence measure continuous space underlying space
  • Conclusions & Future Directions
    • Spatial domains may not satisfy assumptions of classical methods
      • data: auto-correlation, continuous geographic space
      • patterns: global vs. local, e.g. spatial outliers vs. outliers
      • data exploration: maps and albums
    • Open Issues
      • patterns: hot-spots, blobology (shape), spatial trends, …
      • metrics: spatial accuracy(predicted locations), spatial contiguity(clusters)
      • spatio-temporal dataset
      • scale and resolutions sentivity of patterns
      • geo-statistical confidence measure for mined patterns
  • References
    • S. Shekhar, S. Chawla, S. Ravada, A. Fetterer, X. Liu and C.T. Liu, “ Spatial Databases: Accomplishments and Research Needs”, IEEE Transactions on Knowledge and Data Engineering, Jan.-Feb. 1999.
    • S. Shekhar and Y. Huang, “Discovering Spatial Co-location Patterns: a Summary of Results”, In Proc. of 7th International Symposium on Spatial and Temporal Databases (SSTD01), July 2001.
    • S. Shekhar, Y. Huang, and H. Xiong, “Performance Evaluation of Co-location Miner”, the IEEE International Conference on Data Mining (ICDM’01), Nov. 2001. (submitted)
    • S. Shekhar, C.T. Lu, P. Zhang, "Detecting Graph-based Spatial Outliers: Algorithms and Applications“, the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001.
    • S. Shekhar, S. Chawla, the book “Spatial Database: Concepts, Implementation and Trends”. (To be published in 2001 )
    • S. Chawla, S. Shekhar, W. Wu and U. Ozesmi , “Extending Data Mining for Spatial Applications: A Case Study in Predicting Nest Locations”, Proc. Int. Confi. on 2000 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD 2000), Dallas, TX, May 14, 2000.
    • S. Chawla, S. Shekhar, W. Wu and U. Ozesmi, “Modeling Spatial Dependencies for Mining Geospatial Data”, First SIAM International Conference on Data Mining, 2001.
    • S. Shekhar, P.R. Schrater, R. R. Vatsavai, W. Wu, and S. Chawla, “Spatial Contextual Classification and Prediction Models for Mining Geospatial Data”, IEEE Transactions on Multimedia, 2001. (Submitted)
    Some papers are available on the Web sites: http://www.cs.umn.edu/research/shashi-group/