Lecture 9


Published on

  • Be the first to comment

  • Be the first to like this

Lecture 9

  1. 1. Data Mining UMUC CSMN 667 Lecture #9
  2. 2. Lecture 9 “Spatial Data Mining” Regional Prior Probability Water Coast Land Desert Cloud Snow/Ice Glint
  3. 3. Outline <ul><li>Spatial Mining: </li></ul><ul><ul><li>What is it? </li></ul></ul><ul><ul><li>How is it different? </li></ul></ul><ul><li>Spatial Queries </li></ul><ul><li>Indexing </li></ul><ul><li>Spatial Data Mining Primitives </li></ul><ul><li>Concept Hierarchies </li></ul><ul><li>Techniques for Spatial Mining </li></ul>
  4. 4. What is Spatial Mining? <ul><li>Spatial Mining = Mining Spatial Data Sets </li></ul><ul><li>Spatial data refer to any data about objects that occupy real physical space. </li></ul><ul><li>Attributes for spatial data usually will include spatial information. </li></ul><ul><li>Spatial information (metadata) is used to describe objects in space. </li></ul><ul><li>Spatial information includes geometric metadata (e.g., location, shape, size, distance, area, perimeter) and topological metadata (e.g., “neighbor of”, “adjacent to”, “included in”, “includes”). </li></ul>
  5. 5. Spatial Mining: how is it different? <ul><li>Spatial data types naturally carry linkages to neighboring data elements (e.g., contiguous geographic positions). </li></ul><ul><li>Spatial information corresponds to unique attributes and unique relationships between attributes that are not normally found in other databases. </li></ul><ul><li>In many cases, spatial data are stored as rasters: rows and columns, with (x,y) location information implicit in the placement within the raster (e.g., Remote Sensing images). </li></ul><ul><ul><li>(x,y) values are not stored anywhere in the database. </li></ul></ul><ul><ul><li>Special extraction tools are needed. </li></ul></ul><ul><li>These differences can pose challenges to standard data mining algorithms. </li></ul>
  6. 6. Spatial Mining: more challenges <ul><li>Additional challenges … </li></ul><ul><li>How do you index these data collections? </li></ul><ul><ul><li>Perhaps you can use a spatial index (e.g., Latitude, Longitude) </li></ul></ul><ul><li>Raster [x,y] value almost never equals [Latitude, Longitude] ! </li></ul><ul><li>How do you describe the spatial relationships among data items? What attributes should you use? </li></ul><ul><ul><li>These attributes are chosen and controlled by each organization. </li></ul></ul><ul><ul><li>Spatial data mining algorithms won’t know their meaning. </li></ul></ul><ul><li>Almost any data collected by, for, or about human society can be associated with a geo-location. Special GIS tools are (or can be) used = Geographic Information Systems @ http://www.gis.com/ </li></ul><ul><li>As a consequence of this fact, spatial data repositories are HUGE and growing. </li></ul>
  7. 7. Special Cases <ul><li>Image databases (Earth or the Sky) </li></ul><ul><li>Thematic maps (values of attributes or “themes” are displayed in a spatial distribution = a map!) </li></ul>
  8. 8. Spatial Queries <ul><li>Queries must be able to handle both spatial and non-spatial attributes </li></ul><ul><li>Queries may include range or region constraints indirectly (e.g., find all zip codes in the Rocky Mountains, or around Lake Michigan) </li></ul><ul><li>Nearest Neighbor query might be exactly named in this instance = it really is the NEAREST neighbor that you are seeking! </li></ul><ul><li>Distance metric probably uses a real distance (Euclidean spatial distance) </li></ul>
  9. 9. Region Queries <ul><li>Uses concept of MBR ( M inimum B ounding R ectangle)… </li></ul><ul><li>Can be used to index the spatial database: you need to index the database in order to find records. </li></ul>
  10. 10. Spatial Database Indexing <ul><li>Trees are frequently used to index spatial data. </li></ul><ul><li>Quad Tree : based upon assigning data to spatial quadrants </li></ul><ul><li>R-Tree : based on range of values (Lat,Long) assigned to the set of MBR’s. </li></ul><ul><li>k-D Tree : a binary search tree in K dimensions </li></ul><ul><li>… and many more </li></ul><ul><li>Searching a tree-based index is fast: find all intersections at the high level; ignore the rest! </li></ul>
  11. 11. Quad Tree <ul><li>Quad rant Tree </li></ul><ul><li>A Quad Tree is a hierarchical decomposition of the space into quadrants. </li></ul><ul><li>Each level in the tree represents the object as being equivalent to the set of quadrants that contain any portion of the object. </li></ul><ul><li>Each finer-grained level provides a more exact representation of the object. </li></ul><ul><li>The number of levels used is determined by the degree of accuracy desired. </li></ul>
  12. 12. Quad Tree Example (2,3) (12,13,14) (49,52,53,54,…) “Indexing the Triangle”
  13. 13. R-Tree <ul><li>R ange -Tree uses ranges to build the index </li></ul><ul><li>As with Quad Tree, the region is divided into successively smaller rectangles (MBRs), containing the data items. </li></ul><ul><li>Rectangles need not be of the same size or quantity at each level of the tree. </li></ul><ul><li>Rectangles may actually overlap. </li></ul><ul><li>Lowest level cell has only one object. </li></ul><ul><li>Uses tree maintenance algorithms similar to those for B-trees (= traditional binary search trees used in non-spatial databases). </li></ul>
  14. 14. R-Tree Example
  15. 15. k-D Tree <ul><li>k-D imensional index (for multi-dimensional data … dimensions can be 2, 3, or much more) </li></ul><ul><li>Designed for multi-attribute data, not necessarily spatial </li></ul><ul><li>Uses the “Divide and Conquer” approach </li></ul><ul><li>This is a variation of the binary search tree </li></ul><ul><li>Each level is used to index one of the dimensions of the spatial object </li></ul><ul><li>Lowest level cell has only one object </li></ul><ul><li>Divisions are not based on MBRs, but based on successive divisions of the longest dimension. </li></ul>
  16. 16. k-D Tree Example
  17. 17. Example of Tree-based Indexing: Quad Tree Indexing <ul><li>A Quad rant Tree can be used to index data records in a spatial database based upon each record’s spatial location </li></ul><ul><li>A business or government agency may wish to query the database using spatial constraints. </li></ul><ul><li>The Quad Tree (QT) facilitates the storage and querying of spatial data. </li></ul><ul><li>For example: </li></ul><ul><ul><li>The complex geographical boundaries of different states, counties, and cities may be difficult to specify in a database. </li></ul></ul><ul><ul><li>However, each address (home or business) in the database may be indexed with a QT value derived from its spatial location. </li></ul></ul><ul><ul><li>Then, in order to search for all addresses within some particular geographic boundary, one can simply query the database for records that have particular values of the quad tree index. </li></ul></ul>
  18. 18. Quad Tree-based Spatial Mining <ul><li>The Quad rant Tree (QT) index can be applied to data records in multiple spatial databases. </li></ul><ul><li>One can look for associations, or patterns, or clusters, or outliers, or nearest neighbors, etc. using the QT index values across the different databases. </li></ul><ul><li>For example: </li></ul><ul><ul><li>One can apply association mining to the spatial database(s) and find associations among different attributes for database records that correspond to the same location or same type of location (e.g., urban, rural, suburban,…). </li></ul></ul><ul><ul><li>One can apply a nearest neighbor search using the QT index as the query qualifier. </li></ul></ul><ul><ul><li>One can use spatial attributes in some of the nodes (question points) in a Decision Tree. </li></ul></ul>
  19. 19. Quad Tree Example: this grid can be overlaid onto any geographic map. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 24 22 23 25 26 27 36 35 34 33 32 31 30 29 28 44 80 42 41 40 39 38 37 54 53 52 51 50 49 48 47 46 45 43 66 65 64 63 61 62 60 59 58 57 56 55 84 79 78 77 76 75 74 73 72 71 69 70 68 67 83 82 81 The pink circle represents the sales area for our local retail business. Our company’s sales region may be described with the following sets of Quad Tree indices: {1,2} or {6,7,9,12} or {26,27,30,37,39,40,49,50} or with even higher-order indices (corresponding to finer grids). Therefore, if we convert the local phonebook into a high-order Quad-Tree (QT) indexed listing, then all residents whose addresses occur with any of these QT index values {26,27,30,37,39, 40,49,50} are potential customers -- thus, we will send them some of our company’s advertising mail. a b c Address( a )={24} > No Mail Address( b )={30} > Send Mail Address( c )={75} > No Mail
  20. 20. Quad Tree example, continued <ul><li>In the preceding slide, it was completely obvious in this simple example to determine if an address (a), (b), or (c) was within the pink boundary (i.e., within the sales region of our company). </li></ul><ul><li>But, in general, you do not have such a pretty map and picture to look at. </li></ul><ul><li>In general, you only have the data in the database, and it might have thousands or millions of records. </li></ul><ul><li>So, in general, the Quad Tree index provides a very rapid means to discover whether or not a database record is within a selected geographic region. Just check the numbers and you are done! </li></ul><ul><li>The biggest decision is to decide what level to take the indices (i.e., how fine-grained do you want to make the hierarchical grid in the Quad Tree). It is possible to go to very high levels of spatial resolution, requiring very large index numbers. But, it is still easy to query even a huge database for the selected values. </li></ul>
  21. 21. Other Trees for Database Indexing <ul><li>What we have said in the preceding slides about Quad Trees is also applicable to R-Trees, k-D Trees, and other tree-based indices used in spatial databases. </li></ul><ul><li>There are many other types of trees used to index more general databases, including the B-tree, hB-tree, R-tree, R+-tree, R*-tree, R**-tree, packed R-tree, M-tree, SR-tree, SS-tree, RD-tree, BANG file, BV-tree, Buddy tree, Cell tree, G-tree, GBD-tree, Gridfile, KDB-tree, LSD-tree, P-tree, PK-tree, PLOP hashing, Pyramid tree, Q-tree, SKD-tree, TV-tree, UB-tree, Z-order index, etc. </li></ul>
  22. 22. Spatial Data Mining Primitives <ul><li>Orientation relationships: </li></ul><ul><ul><li>North, South, East, West </li></ul></ul><ul><li>Topological relationships: </li></ul><ul><ul><li>see next slide </li></ul></ul><ul><li>Distance measures </li></ul>
  23. 23. Topological Relationships <ul><li>Disjoint </li></ul><ul><li>Overlaps or Intersects </li></ul><ul><li>Equals </li></ul><ul><li>Covered by or inside or contained in </li></ul><ul><li>Covers or contains </li></ul>
  24. 24. Distance Between Objects <ul><li>A lot of this is same old stuff we talked about previously. </li></ul><ul><li>Can use cluster distances (Single Link, etc. - see page 130) </li></ul><ul><li>Euclidean, Manhattan, etc. </li></ul><ul><li>Some special spatial extensions: </li></ul>
  25. 25. Aggregate Proximity <ul><li>Aggregate Proximity – a measure of how close a cluster is to a feature. </li></ul><ul><li>Aggregate proximity relationship finds the k closest features to a cluster. </li></ul><ul><li>CRH Algorithm – uses different shapes: </li></ul><ul><ul><li>Encompassing C ircle </li></ul></ul><ul><ul><li>Isothetic R ectangle = rectangle with edges parallel to the principal axes </li></ul></ul><ul><ul><li>C onvex Hull </li></ul></ul>
  26. 26. CRH example : mathematical formulae exist to calculate easily the distance between any two convex hulls, or circles, or rectangles
  27. 27. Concept Hierarchies <ul><li>Specialization (Progressive Refinement) = move down the hierarchy </li></ul><ul><li>Generalization = move up the hierarchy </li></ul><ul><li>Similar to “roll-up” and “drill-down” </li></ul><ul><li>An implementation: STING </li></ul>
  28. 28. Progressive Refinement <ul><li>Make approximate answers prior to more accurate ones. </li></ul><ul><li>Filter out data that are not part of answer. </li></ul><ul><li>Hierarchical view of data based on spatial relationships </li></ul><ul><li>Coarse predicate recursively refined </li></ul>
  29. 29. Example
  30. 30. STING <ul><li>ST atistical IN formation G rid-based </li></ul><ul><li>Hierarchical technique to divide area into rectangular cells </li></ul><ul><li>Grid data structure contains summary information about each cell </li></ul><ul><li>Hierarchical clustering </li></ul><ul><li>Similar to Quad Tree </li></ul>
  31. 31. Nodes in STING data structure:
  32. 32. Spatial Data Mining Algorithms <ul><li>Most traditional methods still apply, with special “features” to deal with spatial information (geographic / topological metadata): </li></ul><ul><ul><li>Association Rules </li></ul></ul><ul><ul><li>Clustering </li></ul></ul><ul><ul><li>Classification </li></ul></ul><ul><ul><li>Decision Trees </li></ul></ul><ul><ul><li>Neural Nets </li></ul></ul><ul><ul><li>Bayes Networks </li></ul></ul><ul><li>Refer to text and … </li></ul><ul><li>If you are interested in this subject, try a G o o g l e search on “ Spatial Data Mining ”. </li></ul>
  33. 33. Spatial Rules <ul><li>Characteristic Rule : </li></ul><ul><li>The average family income in Dallas is $50,000. </li></ul><ul><li>Discriminant Rule : </li></ul><ul><li>The average family income in Dallas is $50,000, while in Plano the average income is $75,000. </li></ul><ul><li>Association Rule : </li></ul><ul><li>The average family income in Dallas for families living near White Rock Lake is $100,000. </li></ul>
  34. 34. Spatial Association Rules <ul><li>Either antecedent or consequent must contain spatial predicates. </li></ul><ul><li>Views the underlying database as a set of spatial objects. </li></ul><ul><li>May generate these rules using a type of progressive refinement </li></ul>
  35. 35. Spatial Classification <ul><li>Partitions the spatial objects into categories </li></ul><ul><li>May use nonspatial attributes and/or spatial attributes </li></ul><ul><li>Generalization and progressive refinement may be used. </li></ul>
  36. 36. Spatial Decision Tree <ul><li>Approach similar to that used for spatial association rules. </li></ul><ul><li>Spatial objects can be described based on objects closest to them – called its Buffer. </li></ul><ul><li>Description of class based upon aggregation of nearby objects. </li></ul>
  37. 37. Spatial Clustering <ul><li>Detect clusters of irregular shapes. </li></ul><ul><li>Use of centroids and simple distance approaches may not work well. </li></ul><ul><li>Clusters should be independent of order of input. </li></ul>
  38. 38. Summary
  39. 39. Summary of Topics Covered - Lecture 9 <ul><li>Spatial Mining: </li></ul><ul><ul><li>What is it? </li></ul></ul><ul><ul><li>How is it different? </li></ul></ul><ul><li>Spatial Queries </li></ul><ul><li>Indexing </li></ul><ul><li>Spatial Data Mining Primitives </li></ul><ul><li>Concept Hierarchies </li></ul><ul><li>Techniques for Spatial Mining </li></ul>