• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
PowerPoint
 

PowerPoint

on

  • 788 views

 

Statistics

Views

Total Views
788
Views on SlideShare
788
Embed Views
0

Actions

Likes
1
Downloads
17
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    PowerPoint PowerPoint Presentation Transcript

    • Spatial Data Mining: Three Case Studies For additional details www.cs.umn.edu/~shekhar/problems.html Shashi Shekhar, University of Minnesota Presented to UCGIS Summer Assembly 2001
    • Background
      • NSF workshop on GIS and DM (3/99)
      • Spatial data [1, 8] - traffic, bird habitats, global climate, logistics, ...
      • For spatial patterns - outliers, location prediction, associations, sequential associations, trends, …
    • Framework
      • Problem statement: capture special needs
      • Data exploration: maps, new methods
      • Try reusing classical methods
        • from data mining, spatial statistics
      • If reuse is not possible, invent new methods
      • Validation, Performance tuning
    • Case 1: Spatial Outliers
      • Problem: stations different from neighbors [SIGKDD 2001]
      • Data - space-time plot, distr. Of f(x), S(x)
      • Distribution of base attribute:
        • spatially smooth
        • frequency distribution over value domain: normal
      • Classical test - Pr.[item in population] is low
        • Q? distribution of diff.[f(x), neighborhood agg{f(x)}]
        • Insight: this statistic is distributed normally!
        • Test: (z-score on the statistics) > 2
        • Performance - spatial join, clustering methods
    • Spatial outlier detection [4]
      • Spatial outlier
      • A data point that is extreme relative to
      • it neighbors
      • Given
      • A spatial graph G={V,E}
      • A neighbor relationship (K neighbors)
      • An attribute function f: V -> R
      • An aggregation function f aggr : R k -> R
      • Confidence level threshold 
      • Find
      • O = {v i | v i  V, v i is a spatial outlier}
      • Objective
      • Correctness: The attribute values of v i
      • is extreme, compared with its neighbors
      • Computational efficiency
      • Constraints
      • Attribute value is normally distributed
      • Computation cost dominated by I/O op.
    • Spatial outlier detection
      • Spatial Outlier Detection Test
      • 1. Choice of Spatial Statistic
      • S(x) = [f(x)–E y  N(x) (f(y))]
      • Theorem: S(x) is normally distributed
      • if f(x) is normally distributed
      • 2. Test for Outlier Detection
      • | (S(x) -  s ) /  s | > 
      • Hypothesis
      • I/O cost determined by clustering efficiency
      f(x) S(x) Spatial outlier and its neighbors
    • Spatial outlier detection
      • Results
      • 1. CCAM achieves higher clustering efficiency (CE)
      • 2. CCAM has lower I/O cost
      • 3. Higher CE leads to lower
      • I/O cost
      • 4. Page size improves CE for
      • all methods
      Z-order CCAM I/O cost CE value Cell-Tree
    • Case 2: Location Prediction
      • Citations: SIAM DM Conf. 2001, SIGKDD DMKD 2000
      • Problem: predict nesting site in marshes
        • given vegetation, water depth, distance to edge, etc.
      • Data - maps of nests and attributes
        • spatially clustered nests, spatially smooth attributes
      • Classical method: logistic regression, decision trees, bayesian classifier
        • but, independence assumption is violated ! Misses auto-correlation !
        • Spatial auto-regression (SAR), Markov random field bayesian classifier
        • Open issues: spatial accuracy vs. classification accurary
        • Open issue: performance - SAR learning is slow!
    • Location Prediction [6, 7, 8]
      • Given:
      • 1. Spatial Framework
      • 2. Explanatory functions:
      • 3. A dependent function
      • 4. A family of function mappings:
      • Find: A function
      • Objective: maximize
      • classification_accuracy
      • Constraints :
      • Spatial Autocorrelation exists
      Nest locations Distance to open water Vegetation durability Water depth
    • Evaluation: Changing Model
      • Linear Regression
      • Spatial Regression
      • Spatial model is better
    • Evaluation: Changing measure New measure:
    • Case 3: Spatial Association Rules
      • Citation: Symp. On Spatial Databases 2001
      • Problem: Given a set of boolean spatial features
        • find subsets of co-located features, e.g. (fire, drought, vegetation)
        • Data - continuous space, partition not natural, no reference feature
      • Classical data mining approach: association rules
        • But, Look Ma! No Transactions!!! No support measure!
      • Approach: Work with continuous data without transactionizing it!
        • confidence = Pr.[fire at s | drought in N(s) and vegetation in N(s)]
        • support: cardinality of spatial join of instances of fire, drought, dry veg.
        • participation: min. fraction of instances of a features in join result
        • new algorithm using spatial joins and apriori_gen filters
    • Co-location Patterns [2, 3] Answers: and Can you find co-location patterns from the following sample dataset?
    • Co-location Patterns Can you find co-location patterns from the following sample dataset?
    • Co-location Patterns
      • Spatial Co-location
      • A set of features frequently co-located
      • Given
      • A set T of K boolean spatial feature types T={f 1 ,f 2 , … , f k }
      • A set P of N locations P={p 1 , …, p N } in a spatial frame work S, p i  P is of some spatial feature in T
      • A neighbor relation R over locations in S
      • Find
      • T c =  subsets of T frequently co-located
      • Objective
      • Correctness
      • Completeness
      • Efficiency
      • Constraints
      • R is symmetric and reflexive
      • Monotonic prevalence measure
      Reference Feature Centric Window Centric Event Centric
    • Co-location Patterns
      • Participation index
      • Participation ratio pr(f i , c) of feature f i in co-location c = {f 1 , f 2 , …, f k }: fraction of instances of f i with
      • feature {f 1 , …, f i-1 , f i+1 , …, f k } nearby 2.Participation index = min{pr(f i , c)}
      • Algorithm
      • Hybrid Co-location Miner
      Comparison with association rules Pr.[ A in N(L) | B at L ] Pr.[ A in T | B in T ] conditional probability measure neighborhoods transactions collections events /Boolean spatial features item-types item-types support discrete sets Association rules Co-location rules participation index prevalence measure continuous space underlying space
    • Conclusions & Future Directions
      • Spatial domains may not satisfy assumptions of classical methods
        • data: auto-correlation, continuous geographic space
        • patterns: global vs. local, e.g. spatial outliers vs. outliers
        • data exploration: maps and albums
      • Open Issues
        • patterns: hot-spots, blobology (shape), spatial trends, …
        • metrics: spatial accuracy(predicted locations), spatial contiguity(clusters)
        • spatio-temporal dataset
        • scale and resolutions sentivity of patterns
        • geo-statistical confidence measure for mined patterns
    • References
      • S. Shekhar, S. Chawla, S. Ravada, A. Fetterer, X. Liu and C.T. Liu, “ Spatial Databases: Accomplishments and Research Needs”, IEEE Transactions on Knowledge and Data Engineering, Jan.-Feb. 1999.
      • S. Shekhar and Y. Huang, “Discovering Spatial Co-location Patterns: a Summary of Results”, In Proc. of 7th International Symposium on Spatial and Temporal Databases (SSTD01), July 2001.
      • S. Shekhar, Y. Huang, and H. Xiong, “Performance Evaluation of Co-location Miner”, the IEEE International Conference on Data Mining (ICDM’01), Nov. 2001. (submitted)
      • S. Shekhar, C.T. Lu, P. Zhang, "Detecting Graph-based Spatial Outliers: Algorithms and Applications“, the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001.
      • S. Shekhar, S. Chawla, the book “Spatial Database: Concepts, Implementation and Trends”. (To be published in 2001 )
      • S. Chawla, S. Shekhar, W. Wu and U. Ozesmi , “Extending Data Mining for Spatial Applications: A Case Study in Predicting Nest Locations”, Proc. Int. Confi. on 2000 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD 2000), Dallas, TX, May 14, 2000.
      • S. Chawla, S. Shekhar, W. Wu and U. Ozesmi, “Modeling Spatial Dependencies for Mining Geospatial Data”, First SIAM International Conference on Data Mining, 2001.
      • S. Shekhar, P.R. Schrater, R. R. Vatsavai, W. Wu, and S. Chawla, “Spatial Contextual Classification and Prediction Models for Mining Geospatial Data”, IEEE Transactions on Multimedia, 2001. (Submitted)
      Some papers are available on the Web sites: http://www.cs.umn.edu/research/shashi-group/