Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
CSE  8392  SPRING 1999 DATA MINING:  ADVANCED TOPICS Spatial Data Professor Margaret H. Dunham Department of Computer Scie...
Spatial Data <ul><li>Data with a location component </li></ul><ul><li>Location may be represented in many ways </li></ul><...
Spatial Data Issues <ul><li>Is location a data attribute? </li></ul><ul><li>Complex queries involving spatial operators </...
Spatial Data Mining Primitives(R[7]) <ul><li>Rules </li></ul><ul><ul><li>Spatial characteristic - Description of data (Pri...
<ul><li>Concept Hierarchy- Nesting relationship among data </li></ul><ul><ul><li>Move up - more general; Move down - more ...
Spatial Data Mining - Clustering (R[7]) <ul><li>No prior knowledge needed </li></ul><ul><li>Suppose n objects and k cluste...
CLARANS (R[7]) <ul><li>CLustering Applications based on RANdomized Search </li></ul><ul><li>Combines PAM and CLARA.  Uses ...
CLARANS for Spatial Data <ul><li>SD(CLARANS) - Spatial Dominant </li></ul><ul><ul><li>Cluster using CLARANS </li></ul></ul...
Other Spatial Cluster Algorithms (R[7],R[2]) <ul><li>DBSCAN (Density-Based Clustering} </li></ul><ul><ul><li>Cluster has t...
Association Rules for Spatial Terminology (R[7]) <ul><li>Spatial Association Rule - X->Y with confidence of c% </li></ul><...
Spatial Data Mining - Approximation (R[8]) <ul><li>Identify characteristics of clusters </li></ul><ul><li>Characteristic d...
CRH Algorithm (R[8]) <ul><li>Aggregate Proximity Relationships </li></ul><ul><li>Balance between accuracy and efficiency <...
GenCom Algorithm (R[8]) <ul><li>Aggregate Proximity Commonalities </li></ul><ul><li>Appearance Support Condition:  Common ...
Mining in Image and Raster Databases (R[7]) <ul><li>Venus images - Find volcanoes </li></ul><ul><ul><li>Data focusing - Id...
Spatial Data Mining: The Future <ul><li>Trends </li></ul><ul><ul><li>Alternative clustering </li></ul></ul><ul><ul><li>Use...
Upcoming SlideShare
Loading in …5
×

5/4/10

626 views

Published on

  • Be the first to comment

  • Be the first to like this

5/4/10

  1. 1. CSE 8392 SPRING 1999 DATA MINING: ADVANCED TOPICS Spatial Data Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas, Texas 75275 (214) 768-3087 fax: (214) 768-3085 email: mhd@seas.smu.edu www: http://www.seas.smu.edu/~mhd March 1999
  2. 2. Spatial Data <ul><li>Data with a location component </li></ul><ul><li>Location may be represented in many ways </li></ul><ul><ul><li>Abstractly - (x,y) pair </li></ul></ul><ul><ul><li>Realistically - latitude,longitude pair </li></ul></ul><ul><li>Related technologies </li></ul><ul><ul><li>Geographic Information Systems (GIS) </li></ul></ul><ul><ul><li>Remote sensor technologies (satellites, infrared, etc.) </li></ul></ul><ul><ul><li>Medical imaging </li></ul></ul><ul><ul><li>Robotic navigation </li></ul></ul>
  3. 3. Spatial Data Issues <ul><li>Is location a data attribute? </li></ul><ul><li>Complex queries involving spatial operators </li></ul><ul><ul><li>Near </li></ul></ul><ul><ul><li>North, South, East, West, … </li></ul></ul><ul><ul><li>Adjacent </li></ul></ul><ul><ul><li>Contained in </li></ul></ul><ul><ul><li>Intersect - Spatial join </li></ul></ul><ul><li>How to store and index data? </li></ul><ul><ul><li>MBR (Minimum Bounded Rectangle) </li></ul></ul><ul><ul><li>R Tree - Hierarchy representing containment of MBR </li></ul></ul>
  4. 4. Spatial Data Mining Primitives(R[7]) <ul><li>Rules </li></ul><ul><ul><li>Spatial characteristic - Description of data (Price of houses in area) </li></ul></ul><ul><ul><li>Spatial discriminant - Comparison of data in different classes/clusters (Compaing prices of houses in different areas) </li></ul></ul><ul><ul><li>Spatial association - Implication of one set of data by others (Houses near schools are worth more) </li></ul></ul><ul><li>Thematic maps - Maps which show the distribution of a spatial feature </li></ul><ul><ul><li>Raster based - Pixel values used to show the feature values </li></ul></ul><ul><ul><li>Vector based - Spatial item shown by boundary and feature values </li></ul></ul>
  5. 5. <ul><li>Concept Hierarchy- Nesting relationship among data </li></ul><ul><ul><li>Move up - more general; Move down - more detail </li></ul></ul><ul><ul><li>Non-spatial or Spatial </li></ul></ul><ul><li>Spatial Data Dominant Generalization </li></ul><ul><ul><li>Generalize until threshold number of regions reached </li></ul></ul><ul><ul><li>Attribute-oriented induction - Move up hierarchy and summarize data values. Some attribute values may be deleted or tuples combined. </li></ul></ul><ul><ul><li>Fig 3 - Determine data value for each region </li></ul></ul><ul><li>Non-spatial Data Dominant Generalization </li></ul><ul><ul><li>Generalization based on grouping of data based on non-spatial features </li></ul></ul><ul><ul><li>Merge adjacent regions with same generalization values </li></ul></ul><ul><ul><li>Fig 4 - Generalize data values and group into regions </li></ul></ul>Spatial Data Mining - Generalization (R[7])
  6. 6. Spatial Data Mining - Clustering (R[7]) <ul><li>No prior knowledge needed </li></ul><ul><li>Suppose n objects and k clusters desired </li></ul><ul><li>PAM (Partitioning Around Medoids)- Represents cluster by medoid centrally located point in cluster. Determines best medoids by examining all possible values. </li></ul><ul><li>CLARA (Clustering LARge Applications) - Samples database and determines medoids by applying PAM to the sample. Samples of size 40+2k are adequate. </li></ul>
  7. 7. CLARANS (R[7]) <ul><li>CLustering Applications based on RANdomized Search </li></ul><ul><li>Combines PAM and CLARA. Uses a different sample at each step in the algorithm. </li></ul><ul><li>A neighbor of a clusterering is a clustering obtained by changing one medoid </li></ul><ul><li>Given a clustering and set of medoids, randomly choose one medoid </li></ul><ul><li>Try to randomly replace a clustering with a neighbor - checking up to maxneighbor neighbors </li></ul><ul><li>Replace with better neighbor and try again, else a local optimum is found </li></ul><ul><li>Search numlocal local optimum clusterings </li></ul><ul><li>Shown experimentally to outperform PAM and CLARA </li></ul>
  8. 8. CLARANS for Spatial Data <ul><li>SD(CLARANS) - Spatial Dominant </li></ul><ul><ul><li>Cluster using CLARANS </li></ul></ul><ul><ul><li>Perform attribute-oriented induction on objects in each cluster resulting in non-spatial description of each cluster. </li></ul></ul><ul><li>NSD(CLARANS) - Non-spatial dominant </li></ul><ul><ul><li>Apply non-spatial generalizations yielding tuples </li></ul></ul><ul><ul><li>For each generalization, collect spatial components and cluster using CLARANS. </li></ul></ul><ul><ul><li>May combine overlapping clusters (from different applications of NSD algorithm to different data) </li></ul></ul><ul><li>Generalize CLARANS so that objects are stored on disk using a spatial data structure not in MM as assumed originally. </li></ul>
  9. 9. Other Spatial Cluster Algorithms (R[7],R[2]) <ul><li>DBSCAN (Density-Based Clustering} </li></ul><ul><ul><li>Cluster has to have more than a minimum number of points, MinPts </li></ul></ul><ul><ul><li>Each point must be closer than Eps distance to at least one other point in cluster </li></ul></ul><ul><ul><li>Uses R*-tree </li></ul></ul><ul><li>BIRCH (Balanced Iterative Reducing and Clustering) </li></ul><ul><ul><li>Clustering Feature (CF) - Represent cluster not by specific points but by a triple (p877,[2]). </li></ul></ul><ul><ul><li>CF Tree - Balanced tree with branching factor, B, and threshold, T, which is the largest diameter of clusters on leaf nodes. Internal nodes store sums of children CF triples. </li></ul></ul>
  10. 10. Association Rules for Spatial Terminology (R[7]) <ul><li>Spatial Association Rule - X->Y with confidence of c% </li></ul><ul><ul><li>Has spatial or non-spatial predicates </li></ul></ul><ul><ul><li>Ex ([7]): is_a(x,school) ->close_to(x,park) (80%) </li></ul></ul><ul><li>Spatial Predicates - Intersects, Overlaps, Disjoint, Left/Right of, Close to </li></ul><ul><ul><li>VERY expensive </li></ul></ul><ul><li>Two-Step Approach </li></ul><ul><ul><li>Avoid objects which satisfy an expensive predicate and instead find those that satisfy a more general (easy to find) predicate. Use this step as a filter. </li></ul></ul><ul><ul><li>Find large itemsets at this more generalized predicate </li></ul></ul><ul><ul><li>Apply predicates to this filtered set to find large itemsets </li></ul></ul>
  11. 11. Spatial Data Mining - Approximation (R[8]) <ul><li>Identify characteristics of clusters </li></ul><ul><li>Characteristic determined by features (objects existing in spatial area) </li></ul><ul><li>Represent features and clusters by closed polygons (not just MBRs) </li></ul><ul><li>Aggregate Proximity Relationships - Find k features closest to cluster (p884) </li></ul><ul><ul><li>Distance measured by sum of distances to all points in cluster </li></ul></ul><ul><li>Aggregate Proximity Commonalities - Given n clusters, identify features nearest to most of the clusters (p885). May use features or grouping of features as identified in a concept hierarchy. </li></ul>
  12. 12. CRH Algorithm (R[8]) <ul><li>Aggregate Proximity Relationships </li></ul><ul><li>Balance between accuracy and efficiency </li></ul><ul><li>Uses multiple levels of filtering of possible features </li></ul><ul><ul><li>C = encompassing circle </li></ul></ul><ul><ul><li>R = isothetic (minimum bounding) rectangle </li></ul></ul><ul><ul><li>H = convex hull </li></ul></ul><ul><li>Rank features and keep highest ranking ones above threshold </li></ul><ul><ul><li>Highest - intersect; Lowest - disjoint </li></ul></ul><ul><ul><li>Linear, Bisection, Memoization </li></ul></ul><ul><li>Calculation of proximity performed on remaining features </li></ul>
  13. 13. GenCom Algorithm (R[8]) <ul><li>Aggregate Proximity Commonalities </li></ul><ul><li>Appearance Support Condition: Common Features </li></ul><ul><ul><li>S={Fi | Fi in >= m of the n top-k lists} </li></ul></ul><ul><li>Uses concept hierarchies - to ensure adequate number of features </li></ul><ul><ul><li>Superseded, Deferred, Normalized Summation </li></ul></ul><ul><li>Algorithm, Fig5 p 892 </li></ul><ul><li>Ex p 892 </li></ul>
  14. 14. Mining in Image and Raster Databases (R[7]) <ul><li>Venus images - Find volcanoes </li></ul><ul><ul><li>Data focusing - Identify portion of figure to look at </li></ul></ul><ul><ul><li>Feature extraction - Convert features into attributes to look at </li></ul></ul><ul><ul><li>Classification learning - Using Decision Trees </li></ul></ul><ul><li>POSS-II </li></ul><ul><ul><li>Second Palomar Observatory Sky Survey - Classify stellar objects </li></ul></ul><ul><ul><li>Decision trees on training data to generate rules </li></ul></ul><ul><li>CONQUEST (Raster) </li></ul><ul><ul><li>CONtent-based QUErying in Space and Time (Spatial/Temporal) </li></ul></ul><ul><ul><li>Special operators to access data and extract specific information. Strong interactive with domain expert </li></ul></ul><ul><ul><li>Use heuristic rules to obtain data. Based on signal processing. </li></ul></ul><ul><ul><li>Decomposes problem and solves pieces in parallel </li></ul></ul>
  15. 15. Spatial Data Mining: The Future <ul><li>Trends </li></ul><ul><ul><li>Alternative clustering </li></ul></ul><ul><ul><li>Use of temporal-spatial data </li></ul></ul><ul><ul><li>Parallel data mining </li></ul></ul><ul><ul><li>Integration with statistical analysis </li></ul></ul><ul><li>Issues </li></ul><ul><ul><li>Mining in OO databases </li></ul></ul><ul><ul><li>Spatial query language </li></ul></ul><ul><ul><li>Mining with spatial errors </li></ul></ul><ul><ul><li>Rule visualization </li></ul></ul>

×