sdm_han.ppt

2,326 views

Published on

sdm_han.ppt

  1. 1. Data Mining in Spatial Databases: A Multi-Disciplinary Promise Jiawei Han Database Systems Research Lab. Department of Computing Science University of Illinois at Urbana-Champaign http://www.cs.uiuc.edu/~hanj
  2. 2. Outline <ul><li>Why geo-spatial data mining? </li></ul><ul><li>Spatial data mining : major progress </li></ul><ul><ul><li>Spatial OLAP </li></ul></ul><ul><ul><li>Spatial association </li></ul></ul><ul><ul><li>Spatial classification </li></ul></ul><ul><ul><li>Spatial clustering and outlier analysis </li></ul></ul><ul><li>Research challenges in spatial data mining </li></ul>
  3. 3. Why Geo-Spatial Data Mining? <ul><li>Spatial data mining </li></ul><ul><ul><li>Mining interesting knowledge/patterns from huge amount of spatial data </li></ul></ul><ul><li>Necessity is the mother of invention </li></ul><ul><ul><li>Data explosion problem: Data is overwhelming and everywhere — automated data collection, satellite images, remote sensing, GPS, mobile computing and network technology, WWW, etc.) </li></ul></ul><ul><ul><li>Making data in use: Data mining may lead to important discoveries </li></ul></ul>
  4. 4. Spatial Data Mining vs. Traditional Spatial Data Analysis <ul><li>Scalability and performance </li></ul><ul><ul><li>Handle gigabytes of data, interactive exploration, multi-dimensional drilling/rolling, visualization, ... </li></ul></ul><ul><li>Tight integration of database systems and GIS systems </li></ul><ul><ul><li>Most of spatial/aspatial data have been stored in relational database systems (e.g., Oracle, MS/SQLServer, DB2, Informix), GIS (e.g., ArcInfo, MapInfo), or data warehouses </li></ul></ul><ul><ul><li>Tight coupling and seamless integration </li></ul></ul><ul><ul><li>Data cleaning, data integration, and data consolidation </li></ul></ul><ul><li>New methods and functionalities </li></ul><ul><ul><li>Association, sequential patterns, classification methods, ... </li></ul></ul>
  5. 5. Spatial Data Mining: Confluence of Multiple Disciplines Spatial Data Mining Spatial DB System Statistics Mobile Computing Geography Machine Learning (AI) Visualization Remote Sensing
  6. 6. Outline <ul><li>Why geo-spatial data mining? </li></ul><ul><li>Spatial data mining : major progress </li></ul><ul><ul><li>Spatial OLAP </li></ul></ul><ul><ul><li>Spatial association </li></ul></ul><ul><ul><li>Spatial classification </li></ul></ul><ul><ul><li>Spatial clustering and outlier analysis </li></ul></ul><ul><li>Research challenges in spatial data mining </li></ul>
  7. 7. Spatial Data Mining —Major Progress <ul><li>Geo-spatial data warehouse and spatial OLAP </li></ul><ul><li>Spatial data classification/predictive modeling </li></ul><ul><li>Spatial clustering/segmentation </li></ul><ul><li>Spatial association and correlation analysis </li></ul><ul><li>Spatial regression analysis </li></ul><ul><li>Spatio-temporal pattern analysis </li></ul><ul><li>Many more to be explored </li></ul>
  8. 8. Spatial Data Warehousing <ul><li>Spatial data warehouse </li></ul><ul><ul><li>Integrated, subject-oriented, time-variant, and nonvolatile spatial data repository for data analysis </li></ul></ul><ul><li>Spatial data integration: a big issue </li></ul><ul><ul><li>Structure-specific formats (raster- vs. vector-based, OO vs. relational models, different storage and indexing, etc.) </li></ul></ul><ul><ul><li>Vendor-specific formats (ESRI, MapInfo, Integraph, etc.) </li></ul></ul><ul><li>Spatial data cube: Multidimensional spatial database </li></ul><ul><ul><li>Both dimensions and measures may contain spatial components </li></ul></ul>
  9. 9. Star Schema of the BC Weather Warehouse <ul><li>Spatial data warehouse </li></ul><ul><ul><li>Dimensions </li></ul></ul><ul><ul><ul><li>region_name </li></ul></ul></ul><ul><ul><ul><li>time </li></ul></ul></ul><ul><ul><ul><li>temperature </li></ul></ul></ul><ul><ul><ul><li>precipitation </li></ul></ul></ul><ul><ul><li>Measurements </li></ul></ul><ul><ul><ul><li>region_map </li></ul></ul></ul><ul><ul><ul><li>area </li></ul></ul></ul><ul><ul><ul><li>count </li></ul></ul></ul>Fact table Dimension table
  10. 10. Spatial OLAP — OLAP on Map Data
  11. 11. Dynamic Merging of Spatial Objects? <ul><li>Materializing (precomputing) all? — too much storage space </li></ul><ul><li>On-line merge? — slow, expensive! </li></ul><ul><li>A better way: object-based, selective (partial) materialization </li></ul>
  12. 12. Spatial Association and Correlation Mining FIND SPATIAL ASSOCIATION RULE DESCRIBING &quot;Golf Course&quot; FROM Washington_Golf_courses, Washington WHERE CLOSE_TO(Washington_Golf_courses.Obj, Washington.Obj, &quot;3 km&quot;) AND Washington.CFCC <> &quot;D81&quot; IN RELEVANCE TO Washington_Golf_courses.Obj, Washington.Obj, CFCC SET SUPPORT THRESHOLD 0.5 What kind of objects are usually located close to golf course?
  13. 13. Efficient Mining of Spatial Associations <ul><li>Progressive refinement </li></ul><ul><ul><li>Hierarchy of spatial relationship: </li></ul></ul><ul><ul><ul><li>g_close_to : near_by , touch , intersect , contain , etc. </li></ul></ul></ul><ul><ul><ul><li>First search for rough relationship and then refine it </li></ul></ul></ul><ul><ul><li>Rough spatial computation (as a filter) </li></ul></ul><ul><ul><ul><li>Using MBR or R-tree for rough estimation </li></ul></ul></ul><ul><ul><li>Detailed spatial algorithm (as refinement) </li></ul></ul><ul><ul><ul><li>Apply only to those objects which have passed the rough spatial association test (no less than min_support ) </li></ul></ul></ul><ul><li>Micro-clustering and join indexing methods </li></ul>
  14. 14. Spatial Classification and Model Construction <ul><li>Generalization- or clustering- based induction </li></ul><ul><li>Interactive classification </li></ul>
  15. 15. Can Typical Classification Methods Be Applied to Spatial Classification? <ul><li>Decision-tree classification: </li></ul><ul><ul><li>Entropy-based information-gain vs. Gini-index vs. MDL </li></ul></ul><ul><ul><li>Tree pruning methods: boosting/bagging </li></ul></ul><ul><li>Naïve-Bayesian classifier + boosting </li></ul><ul><li>Bayesian belief networks </li></ul><ul><li>Neural network </li></ul><ul><li>Genetic programming </li></ul><ul><li>Nearest neighbor and case-based reasoning </li></ul><ul><li>Support vector machine method </li></ul><ul><li>Association-based multi-dimensional classification </li></ul>
  16. 16. What Kind of Houses Are Highly Valued? —Associative Classification L H H H H L L L L H H H H H H H H H H H L L L L L H H H H C03 C04 C02 C08 Highway C05 C06 C01 H H H H H C09 L L L C10 lake L L L L L C07
  17. 17. Grouping and Associating Spatial Features for Classification L 2100 41 inside(east_Vancouver), close_to (Fraser_st), close_to (sky_train_station) c18 H1857 ... ... ... ...... ... ... H 3400 18 close_to(como lake), next_to(Futureshop), ... C05 H82 H 3100 20 next_to (QueenEliz_park), next_to (Cambie_road), ... C09 H45 L 2500 32 close_to(Lougheed_Hwy), next_to(Austin_elmntary), ... C08 H03 H 2300 16 close_to(como lake), next_to(Futureshop), ... C05 H01 Class Sqr_ft Yrs Spatial Features MCluster_ID House_ID
  18. 18. <ul><li>Mining volcanoes on Venus </li></ul><ul><ul><li>Training set provided by experts </li></ul></ul><ul><ul><li>Model constructed can be used for prediction </li></ul></ul><ul><li>Finding stars in galaxies (JPL’96) </li></ul><ul><li>QuakeFinder </li></ul><ul><ul><li>Find earth quakes related to spatial info </li></ul></ul>Spatial Classification: Typical Examples
  19. 19. <ul><li>Function </li></ul><ul><ul><li>Detect changes and trends along a spatial dimension </li></ul></ul><ul><ul><li>Study the trend of non-spatial or spatial data changing with space </li></ul></ul><ul><li>Application examples </li></ul><ul><ul><li>Observe the trend of changes of the climate or vegetation with the increasing distance from an ocean </li></ul></ul><ul><ul><li>Crime rate or unemployment rate change with regard to city geo-distribution </li></ul></ul>Spatial Trend Analysis
  20. 20. Spatial Cluster Analysis <ul><li>Mining clusters —k-means, k-medoids, hierarchical, density-based, etc. </li></ul><ul><li>Analysis of distinct features of the clusters </li></ul>
  21. 21. Density-Based Cluster analysis: OPTICS & Its Applications
  22. 22. Clustering and Distribution Density Functions: Density Attractor
  23. 23. Center-Defined and Arbitrary Shaped
  24. 24. STING: A Statistical Information Grid Approach <ul><li>Wang, Yang and Muntz (VLDB’97) </li></ul><ul><li>Each cell stores statistical distribution of measure at low level </li></ul><ul><li>Multi-level resolution </li></ul>
  25. 25. WaveCluster <ul><li>G. Sheikholeslami, et al. (1998) Multiple wavelet transformation-based cluster analysis </li></ul>
  26. 26. Constraints-Based Clustering <ul><li>Constraints on individual objects </li></ul><ul><ul><li>Simple selection of such objects before clustering </li></ul></ul><ul><li>Clustering parameters as constraints </li></ul><ul><ul><li>K-means, density-based: radius, min-# of points </li></ul></ul><ul><li>Constraints imposed by physical obstacles </li></ul><ul><ul><li>Clustering with Obstructed Distance </li></ul></ul><ul><li>Constraints specified on clusters using SQL aggregates </li></ul><ul><ul><li>Sum of the profits in each cluster > 1 million $ </li></ul></ul><ul><ul><li>Average sales in each cluster > 20 million $s </li></ul></ul><ul><ul><li>Min # of golden customers (in each cluster) > 1000 </li></ul></ul>
  27. 27. Constraint-Based Clustering: Planning ATM Locations Mountain River Bridge Spatial data with obstacles C 1 C 2 C 3 C 4 Clustering without taking obstacles into consideration
  28. 28. Clustering with Spatial Obstacles Taking obstacles into account Not Taking obstacles into account
  29. 29. Towards Spatial Data Mining System: An Architecture Graphic User Interface Spatial DB meta data: hierarchy Non-Spatial DB Geo-Classifier Geo-OLAP Analyzer Geo-Predictor Geo-Clustor Geo-Associator Future Modules Future Modules Spatial Database and Warehouse Server
  30. 30. Outline <ul><li>Why geo-spatial data mining? </li></ul><ul><li>Spatial data mining : major progress </li></ul><ul><ul><li>Spatial OLAP </li></ul></ul><ul><ul><li>Spatial association </li></ul></ul><ul><ul><li>Spatial classification </li></ul></ul><ul><ul><li>Spatial clustering and outlier analysis </li></ul></ul><ul><li>Research challenges in spatial data mining </li></ul>
  31. 31. Research Challenges in Spatial Data Mining <ul><li>Mining temporal spatial data </li></ul><ul><li>Mining spatial-related stream data </li></ul><ul><li>Spatial data mining applications (land use, bio-medical) </li></ul>
  32. 32. Conclusions <ul><li>Spatial data mining vs. traditional spatial analysis </li></ul><ul><ul><li>Scalability, architecture, functions, methods </li></ul></ul><ul><li>Good progress has been made on spatial data mining </li></ul><ul><ul><li>OLAP, association, clustering, classification, outlier analysis, etc. </li></ul></ul><ul><li>Still lots to be done! Young and promising direction </li></ul><ul><li>Joint efforts (from multiple disciplines) lead to joyous promises! </li></ul>
  33. 33. http://www.cs.uiuc.edu/~hanj <ul><li>Thank you !!! </li></ul>
  34. 34. Some References on Spatial Data Mining <ul><li>H. Miller and J. Han (eds.), Geographic Data Mining and Knowledge Discovery , Taylor and Francis , 2001. </li></ul><ul><li>Ester M., Frommelt A., Kriegel H.-P., Sander J.: Spatial Data Mining: Database Primitives, Algorithms and Efficient DBMS Support , Data Mining and Knowledge Discovery, an International Journal. 4, 2000, pp. 193-216. </li></ul><ul><li>J. Han, M. Kamber, and A. K. H. Tung, &quot; Spatial Clustering Methods in Data Mining: A Survey &quot;, in H. Miller and J. Han (eds.), Geographic Data Mining and Knowledge Discovery , Taylor and Francis, 2000. </li></ul><ul><li>Y. Bedard, T. Merrett, and J. Han, &quot; Fundamentals of Geospatial Data Warehous ing for Geo-graphic Knowledge Discovery &quot;, in H. Miller and J. Han (eds.), Geographic Data Mining and Knowledge Discovery , Taylor and Francis, 2000 </li></ul>

×