Data Mining - Steve Tanner, UAH

608 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
608
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Open your presentation with an attention-getting incident. Choose an incident your audience relates to. The incidence is the evidence that supports the action and proves the benefit. Beginning with a motivational incident prepares your audience for the action step that follows.
  • Data Mining - Steve Tanner, UAH

    1. 1. Data Mining Research and Applications Workshop on Cyberinfrastructure For Environmental Research and Education October 31, 2002 Steve Tanner Information Technology and Systems Center University of Alabama in Huntsville [email_address] 256.824.5143 www.itsc.uah.edu
    2. 2. <ul><li>What is the most effective approach to developing an integrated framework and plan for an interdisciplinary environmental cyberinfrastructure? </li></ul><ul><li>What organizational structure is needed to provide long-term support for data storage, access, model development, and services for a global clientele of researchers, educators, policy makers, and citizens? </li></ul><ul><li>How will effective interagency and public-private partnerships be formed to provide financial support for such an extensive and costly system? </li></ul><ul><li>How can communication and coordination among computer scientists and environmental researchers and educators be enhanced to develop this innovative, powerful, and accessible infrastructure? </li></ul>Key Questions:
    3. 3. Data Mining <ul><li>Data Mining is an interdisciplinary field drawing from areas such as statistics, machine learning, pattern recognition and others </li></ul><ul><li>Automated discovery of patterns, anomalies, etc. from vast observational and model data sets </li></ul><ul><li>Derived knowledge for decision making, predictions and disaster response </li></ul><ul><li>ADaM – Algorithm Development and Mining System </li></ul><ul><li>datamining.itsc.uah.edu </li></ul>
    4. 4. <ul><li>Clustering Techniques </li></ul><ul><ul><li>K Means </li></ul></ul><ul><ul><li>Isodata </li></ul></ul><ul><ul><li>Maximum </li></ul></ul><ul><li>Pattern Recognition </li></ul><ul><ul><li>Bayes Classifier </li></ul></ul><ul><ul><li>Minimum Distribution Classifier </li></ul></ul><ul><li>Image Analysis </li></ul><ul><ul><li>Boundary Detection </li></ul></ul><ul><ul><li>Cooccurrence Matrix </li></ul></ul><ul><ul><li>Dilation and Erosion </li></ul></ul><ul><ul><li>Histogram Operations </li></ul></ul><ul><ul><li>Polygon Circumscript </li></ul></ul><ul><ul><li>Spatial Filtering </li></ul></ul><ul><ul><li>Texture Operations </li></ul></ul><ul><li>Genetic Algorithms </li></ul><ul><li>Neural Networks </li></ul><ul><li>Etc. </li></ul>Techniques used for Data Mining Data Mining systems usually involve a toolbox of many different techniques and a means for combining them
    5. 5. <ul><li>Google </li></ul><ul><ul><li>Complex algorithm sequence to decide order </li></ul></ul><ul><li>Amazon.Com </li></ul><ul><ul><li>Additional purchase suggestions </li></ul></ul><ul><li>Credit Card Fraud </li></ul><ul><ul><li>Event notification of odd usage </li></ul></ul>Typical Everyday Encounters with Data Mining Most current Data Mining applications are text based. Text provides an easily readable source of heterogeneous data. Mining of scientific data sets is more complex.
    6. 6. User Perspective and Data Perspective of the Data Mining Process Data Stores Information Analysis Knowledge Decision Dataset Volume Value Calibration & Navigation Preprocessing Transformation Dataset Specific Algorithms Domain Specific Algorithms User Perspective Data Perspective Data
    7. 7. Scientific Analysis <ul><li>Harnesses human analysis capabilities </li></ul><ul><ul><li>Highly creative </li></ul></ul><ul><li>Based on theory and hypothesis formulation </li></ul><ul><ul><li>Physical basis is normally used for algorithms </li></ul></ul><ul><li>Drawing insights about the underlying phenomena </li></ul><ul><li>Rapidly widening gap between data collection capabilities and the ability to analyze data </li></ul><ul><li>Potential of vast amounts of data to be unused </li></ul><ul><li>Provides automation of the analysis process </li></ul><ul><li>Can be used for dimensionality reduction when manual examination of data is impossible </li></ul><ul><li>Can have limitations </li></ul><ul><ul><li>May not utilize domain knowledge </li></ul></ul><ul><ul><li>May be difficult to prove validity of the results </li></ul></ul><ul><li>There may not be a physical basis </li></ul><ul><li>Should be viewed as complimentary tool and not a replacement for scientific analysis </li></ul>Data Mining
    8. 8. Similarity between Data Mining and Scientific Analysis Process
    9. 9. <ul><li>Mining Framework (ADaM) </li></ul><ul><ul><li>Complete System (Client and Engine) </li></ul></ul><ul><ul><li>Mining Engine (User provides its own client) </li></ul></ul><ul><ul><li>Application Specific Mining Systems </li></ul></ul><ul><ul><li>Operations Tool Kit </li></ul></ul><ul><ul><li>Stand Alone Mining Algorithms </li></ul></ul><ul><ul><li>Data Fusion </li></ul></ul><ul><li>Distributed/Federated Mining </li></ul><ul><ul><li>Distributed services </li></ul></ul><ul><ul><li>Distributed data </li></ul></ul><ul><ul><li>Chaining using Interchange Technologies </li></ul></ul><ul><li>On-board Mining (EVE) </li></ul><ul><ul><li>Real time and distributed mining </li></ul></ul><ul><ul><li>Processing environment constraints </li></ul></ul>Mining Environments
    10. 10. Using the Mining Framework: Focusing on the information in data
    11. 11. The ADaM Processing Model Translated Data Preprocessed Data Patterns/ Models Results Raw Data Output GIF Images HDF Raster Images HDF Scientific Data Sets HDF-ESO Polygons (ASCII, DXF) SSM/I MSFC Brightness Temp TIFF Images GeoTIFF Others... Preprocessing Analysis Clustering K Means Isodata Maximum Pattern Recognition Bayes Classifier Min. Dist. Classifier Image Analysis Boundary Detection Cooccurrence Matrix Dilation and Erosion Histogram Operations Polygon Circumscript Spatial Filtering Texture Operations Genetic Algorithms Neural Networks Others… Selection and Sampling Subsetting Subsampling Select by Value Coincidence Search Grid Manipulation Grid Creation Bin Aggregate Bin Select Grid Aggregate Grid Select Find Holes Image Processing Cropping Inversion Thresholding Others... Processing Input PIP-2 SSM/I Pathfinder SSM/I TDR SSM/I NESDIS Lvl 1B SSM/I MSFC Brightness Temp US Rain Landsat ASCII Grass Vectors (ASCII Text) HDF HDF-EOS GIF Intergraph Raster Others...
    12. 12. Iterative Nature of the Data Mining Process DATA PREPROCESSING CLEANING And INTEGRATION MINING SELECTION And TRANSFORMATION DISCOVERY KNOWLEDGE EVALUATION And PRESENTATION
    13. 13. Distributed/Federated Mining: Meshing data and algorithms to generate knowledge
    14. 14. ADaM : Mining Environment for Scientific Data <ul><li>The system provides knowledge discovery, feature detection and content-based searching for data values, as well as for metadata. </li></ul><ul><ul><li>contains over 120 different operations </li></ul></ul><ul><ul><li>Operations vary from specialized science data-set specific algorithms to various digital image processing techniques, processing modules for automatic pattern recognition, machine perception, neural networks, genetic algorithms and others </li></ul></ul>
    15. 15. Classification Based on Texture Features and Edge Density <ul><li>Cumulus cloud fields have a very characteristic texture signature in the GOES visible imagery </li></ul><ul><li>Science Rationale: Man-made changes to land use cause changes in weather patterns, especially cumulus clouds </li></ul><ul><li>Comparison based on </li></ul><ul><ul><li>Accuracy of detection </li></ul></ul><ul><ul><li>Amount of time required to classify </li></ul></ul>
    16. 16. Parallel Version of Cloud Extraction Laplacian Filter Sobel Horizontal Filter Sobel Vertical Filter Energy Computation Energy Computation Energy Computation Energy Computation Classifier GOES Image Cloud Image Master Slave 1 Slave 2 Slave 3 <ul><li>GOES images can be used to recognize cumulus cloud fields </li></ul><ul><li>Cumulus clouds are small and do not show up well in 4km resolution IR channels </li></ul><ul><li>Detection of cumulus cloud fields in GOES can be accomplished by using texture features or edge detectors </li></ul><ul><li>Three edge detection filters are used together to detect cumulus clouds which lends itself to implementation on a parallel cluster </li></ul>GOES Image Cumulus Cloud Mask
    17. 17. Automated Data Analysis for Boundary Detection and Quantification <ul><li>Analysis of polar cap auroras in large volumes of spacecraft UV images </li></ul><ul><li>Science Rationale: Indicators to predict geomagnetic storm </li></ul><ul><ul><li>Damage satellites </li></ul></ul><ul><ul><li>Disrupt radio connection </li></ul></ul><ul><li>Developing different mining algorithms to detect and quantify polar cap boundary </li></ul>Polar Cap Boundary
    18. 18. Detecting Signatures <ul><li>Science Rationale: Mesocyclone signatures in Radar data are indicators of Tornadic activity </li></ul><ul><li>Developing an algorithm based on wind velocity shear signatures </li></ul><ul><ul><li>Improve accuracy and reduce false alarm rates </li></ul></ul>
    19. 19. Genetic Subtyping Using Hierarchical Clustering <ul><li>Biologists are interested in comparing DNA sequences to see how closely related they are to one another </li></ul><ul><li>Phylogenetic trees are constructed by performing hierarchical clustering on DNA sequences using genetic distance as a distance measure </li></ul><ul><li>Such trees show which organisms are most likely share common ancestors, and may provide information about how various subtypes of organisms evolved </li></ul><ul><li>This information is useful when studying disease causing organisms such as viruses and bacteria, because genetically similar types should behave in similar ways </li></ul>
    20. 20. Mining on Data Ingest: Tropical Cyclone Detection Advanced Microwave Sounding Unit (AMSU-A) Data Calibration/ Limb Correction/ Converted to Tb Mining Environment Data Archive Result Results are placed on the web, made available to National Hurricane Center & Joint Typhoon Warning Center, and stored for further analysis <ul><li>Mining Plan: </li></ul><ul><li>Water cover mask to eliminate land </li></ul><ul><li>Laplacian filter to compute temperature gradients </li></ul><ul><li>Science Algorithm to estimate wind speed </li></ul><ul><li>Contiguous regions with wind speeds above a desired threshold identified </li></ul><ul><li>Additional test to eliminate false positives </li></ul><ul><li>Maximum wind speed and location produced </li></ul>Hurricane Floyd Further Analysis Knowledge Base pm-esip.msfc.nasa.gov/
    21. 21. AMSU Product Generation TMI AMSU-A SSM/I SSM/T2 Order Staging PM-ESIP Catalog AMSU-A Ingest ADaM-based Processing Distributed Data Stores TMI Ingest and Product Generation Data Ingest & Processing Custom Processing Multiple Mining Environments: Passive Microwave ESIP Information System Out put Process Subset//Grid/Format In- put ADaM Servers Web Interfaces & Applications AMSU-A Images Temperature Trends STT Application Visualization & Exploration FTP Cyclone Winds Data Ordering
    22. 22. Interoperability: Accessing Heterogeneous Data <ul><li>Science data comes in: </li></ul><ul><ul><li>Different formats, types and structures </li></ul></ul><ul><ul><li>Different states of processing (raw, calibrated, derived, modeled or interpreted) </li></ul></ul><ul><ul><li>Enormous volumes </li></ul></ul><ul><li>Heterogeneity leads to data usability problems </li></ul><ul><li>One approach: Standard data formats </li></ul><ul><ul><li>Difficult to implement and enforce </li></ul></ul><ul><ul><li>Can’t anticipate all needs </li></ul></ul><ul><ul><ul><li>Some data can’t be modeled or is lost in translation </li></ul></ul></ul><ul><ul><li>The cost of converting legacy data </li></ul></ul><ul><li>A better approach: Interchange Technologies </li></ul><ul><ul><li>Earth Science Markup Language </li></ul></ul>The Problem DATA FORMAT 1 DATA FORMAT 2 DATA FORMAT 3 READER 1 READER 2 FORMAT CONVERTER ESML LIBRARY APPLICATION DATA FORMAT 1 DATA FORMAT 2 DATA FORMAT 3 The Solution APPLICATION ESML FILE ESML FILE ESML FILE
    23. 23. Chained Image Processing Services Data Data Files ESML WMS (Java/Windows) Draw Image (PERL/C – Linux) Data Files Knowledge Base Service Chaining is used to integrate modules – or services – developed on distributed platforms and different languages for a single processing solution. GeoCrop (Perl/Linux) Resample (Perl/C – Linux) Format (Perl/Linux) Data Streams Chained Services ESML Lib Reader (Java/C+ Windows)
    24. 24. Data Integration using Web Mapping Services Globe AMSU-A Knowledge Base ITSC Coastlines Countries MCS Events Cyclone Events AMSU-A Channel 01 AMSU-A data overlaid with MCS and Cyclone events for September 2000, merged with world boundaries from Globe.
    25. 25. Fused Displays from Multiple Servers Analysis: Correlate MCSs and cyclones with atmospheric temperatures for September 2000.
    26. 26. Model and Observation Data FEATURE I FEATURE II FEATURE III FEATURE SET I EVENT A FEATURE X FEATURE Y EVENT B CONCEPTUAL LEVEL CONCEPT MINING DATA FILE LEVEL DECISION SUPPORT MULTI-LEVEL MINING Concept Hierarchy for Data Mining and Fusion
    27. 27. On-Board Real-Time Processing Sensor Control/Targeting <ul><li>Anomaly detection </li></ul><ul><li>Data Mining </li></ul><ul><li>Autonomous Decision Making </li></ul><ul><li>Immediate response </li></ul><ul><li>Direct satellite to Earth delivery of results </li></ul>EVE – Environment for On-board Processing www.itsc.uah.edu/eve
    28. 28. A Reconfigurable Web of Interacting Sensors Ground Network Ground Network Ground Network Military Weather Satellite Constellations Communications
    29. 29. Example Plan: Threshold events in AMSU-A Streaming Data EVE
    30. 30. Data Integration and Mining: From Global Information to Local Knowledge Precision Agriculture Emergency Response Weather Prediction Urban Environments
    31. 31. <ul><li>What is the most effective approach to developing an integrated framework and plan for an interdisciplinary environmental cyberinfrastructure? </li></ul><ul><li>What organizational structure is needed to provide long-term support for data storage, access, model development, and services for a global clientele of researchers, educators, policy makers, and citizens? </li></ul><ul><li>How will effective interagency and public-private partnerships be formed to provide financial support for such an extensive and costly system? </li></ul><ul><li>How can communication and coordination among computer scientists and environmental researchers and educators be enhanced to develop this innovative, powerful, and accessible infrastructure? </li></ul>Key Questions:

    ×