Your SlideShare is downloading. ×
Data Mining - Steve Tanner, UAH
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Data Mining - Steve Tanner, UAH

425
views

Published on


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
425
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Open your presentation with an attention-getting incident. Choose an incident your audience relates to. The incidence is the evidence that supports the action and proves the benefit. Beginning with a motivational incident prepares your audience for the action step that follows.
  • Transcript

    • 1. Data Mining Research and Applications Workshop on Cyberinfrastructure For Environmental Research and Education October 31, 2002 Steve Tanner Information Technology and Systems Center University of Alabama in Huntsville [email_address] 256.824.5143 www.itsc.uah.edu
    • 2.
      • What is the most effective approach to developing an integrated framework and plan for an interdisciplinary environmental cyberinfrastructure?
      • What organizational structure is needed to provide long-term support for data storage, access, model development, and services for a global clientele of researchers, educators, policy makers, and citizens?
      • How will effective interagency and public-private partnerships be formed to provide financial support for such an extensive and costly system?
      • How can communication and coordination among computer scientists and environmental researchers and educators be enhanced to develop this innovative, powerful, and accessible infrastructure?
      Key Questions:
    • 3. Data Mining
      • Data Mining is an interdisciplinary field drawing from areas such as statistics, machine learning, pattern recognition and others
      • Automated discovery of patterns, anomalies, etc. from vast observational and model data sets
      • Derived knowledge for decision making, predictions and disaster response
      • ADaM – Algorithm Development and Mining System
      • datamining.itsc.uah.edu
    • 4.
      • Clustering Techniques
        • K Means
        • Isodata
        • Maximum
      • Pattern Recognition
        • Bayes Classifier
        • Minimum Distribution Classifier
      • Image Analysis
        • Boundary Detection
        • Cooccurrence Matrix
        • Dilation and Erosion
        • Histogram Operations
        • Polygon Circumscript
        • Spatial Filtering
        • Texture Operations
      • Genetic Algorithms
      • Neural Networks
      • Etc.
      Techniques used for Data Mining Data Mining systems usually involve a toolbox of many different techniques and a means for combining them
    • 5.
      • Google
        • Complex algorithm sequence to decide order
      • Amazon.Com
        • Additional purchase suggestions
      • Credit Card Fraud
        • Event notification of odd usage
      Typical Everyday Encounters with Data Mining Most current Data Mining applications are text based. Text provides an easily readable source of heterogeneous data. Mining of scientific data sets is more complex.
    • 6. User Perspective and Data Perspective of the Data Mining Process Data Stores Information Analysis Knowledge Decision Dataset Volume Value Calibration & Navigation Preprocessing Transformation Dataset Specific Algorithms Domain Specific Algorithms User Perspective Data Perspective Data
    • 7. Scientific Analysis
      • Harnesses human analysis capabilities
        • Highly creative
      • Based on theory and hypothesis formulation
        • Physical basis is normally used for algorithms
      • Drawing insights about the underlying phenomena
      • Rapidly widening gap between data collection capabilities and the ability to analyze data
      • Potential of vast amounts of data to be unused
      • Provides automation of the analysis process
      • Can be used for dimensionality reduction when manual examination of data is impossible
      • Can have limitations
        • May not utilize domain knowledge
        • May be difficult to prove validity of the results
      • There may not be a physical basis
      • Should be viewed as complimentary tool and not a replacement for scientific analysis
      Data Mining
    • 8. Similarity between Data Mining and Scientific Analysis Process
    • 9.
      • Mining Framework (ADaM)
        • Complete System (Client and Engine)
        • Mining Engine (User provides its own client)
        • Application Specific Mining Systems
        • Operations Tool Kit
        • Stand Alone Mining Algorithms
        • Data Fusion
      • Distributed/Federated Mining
        • Distributed services
        • Distributed data
        • Chaining using Interchange Technologies
      • On-board Mining (EVE)
        • Real time and distributed mining
        • Processing environment constraints
      Mining Environments
    • 10. Using the Mining Framework: Focusing on the information in data
    • 11. The ADaM Processing Model Translated Data Preprocessed Data Patterns/ Models Results Raw Data Output GIF Images HDF Raster Images HDF Scientific Data Sets HDF-ESO Polygons (ASCII, DXF) SSM/I MSFC Brightness Temp TIFF Images GeoTIFF Others... Preprocessing Analysis Clustering K Means Isodata Maximum Pattern Recognition Bayes Classifier Min. Dist. Classifier Image Analysis Boundary Detection Cooccurrence Matrix Dilation and Erosion Histogram Operations Polygon Circumscript Spatial Filtering Texture Operations Genetic Algorithms Neural Networks Others… Selection and Sampling Subsetting Subsampling Select by Value Coincidence Search Grid Manipulation Grid Creation Bin Aggregate Bin Select Grid Aggregate Grid Select Find Holes Image Processing Cropping Inversion Thresholding Others... Processing Input PIP-2 SSM/I Pathfinder SSM/I TDR SSM/I NESDIS Lvl 1B SSM/I MSFC Brightness Temp US Rain Landsat ASCII Grass Vectors (ASCII Text) HDF HDF-EOS GIF Intergraph Raster Others...
    • 12. Iterative Nature of the Data Mining Process DATA PREPROCESSING CLEANING And INTEGRATION MINING SELECTION And TRANSFORMATION DISCOVERY KNOWLEDGE EVALUATION And PRESENTATION
    • 13. Distributed/Federated Mining: Meshing data and algorithms to generate knowledge
    • 14. ADaM : Mining Environment for Scientific Data
      • The system provides knowledge discovery, feature detection and content-based searching for data values, as well as for metadata.
        • contains over 120 different operations
        • Operations vary from specialized science data-set specific algorithms to various digital image processing techniques, processing modules for automatic pattern recognition, machine perception, neural networks, genetic algorithms and others
    • 15. Classification Based on Texture Features and Edge Density
      • Cumulus cloud fields have a very characteristic texture signature in the GOES visible imagery
      • Science Rationale: Man-made changes to land use cause changes in weather patterns, especially cumulus clouds
      • Comparison based on
        • Accuracy of detection
        • Amount of time required to classify
    • 16. Parallel Version of Cloud Extraction Laplacian Filter Sobel Horizontal Filter Sobel Vertical Filter Energy Computation Energy Computation Energy Computation Energy Computation Classifier GOES Image Cloud Image Master Slave 1 Slave 2 Slave 3
      • GOES images can be used to recognize cumulus cloud fields
      • Cumulus clouds are small and do not show up well in 4km resolution IR channels
      • Detection of cumulus cloud fields in GOES can be accomplished by using texture features or edge detectors
      • Three edge detection filters are used together to detect cumulus clouds which lends itself to implementation on a parallel cluster
      GOES Image Cumulus Cloud Mask
    • 17. Automated Data Analysis for Boundary Detection and Quantification
      • Analysis of polar cap auroras in large volumes of spacecraft UV images
      • Science Rationale: Indicators to predict geomagnetic storm
        • Damage satellites
        • Disrupt radio connection
      • Developing different mining algorithms to detect and quantify polar cap boundary
      Polar Cap Boundary
    • 18. Detecting Signatures
      • Science Rationale: Mesocyclone signatures in Radar data are indicators of Tornadic activity
      • Developing an algorithm based on wind velocity shear signatures
        • Improve accuracy and reduce false alarm rates
    • 19. Genetic Subtyping Using Hierarchical Clustering
      • Biologists are interested in comparing DNA sequences to see how closely related they are to one another
      • Phylogenetic trees are constructed by performing hierarchical clustering on DNA sequences using genetic distance as a distance measure
      • Such trees show which organisms are most likely share common ancestors, and may provide information about how various subtypes of organisms evolved
      • This information is useful when studying disease causing organisms such as viruses and bacteria, because genetically similar types should behave in similar ways
    • 20. Mining on Data Ingest: Tropical Cyclone Detection Advanced Microwave Sounding Unit (AMSU-A) Data Calibration/ Limb Correction/ Converted to Tb Mining Environment Data Archive Result Results are placed on the web, made available to National Hurricane Center & Joint Typhoon Warning Center, and stored for further analysis
      • Mining Plan:
      • Water cover mask to eliminate land
      • Laplacian filter to compute temperature gradients
      • Science Algorithm to estimate wind speed
      • Contiguous regions with wind speeds above a desired threshold identified
      • Additional test to eliminate false positives
      • Maximum wind speed and location produced
      Hurricane Floyd Further Analysis Knowledge Base pm-esip.msfc.nasa.gov/
    • 21. AMSU Product Generation TMI AMSU-A SSM/I SSM/T2 Order Staging PM-ESIP Catalog AMSU-A Ingest ADaM-based Processing Distributed Data Stores TMI Ingest and Product Generation Data Ingest & Processing Custom Processing Multiple Mining Environments: Passive Microwave ESIP Information System Out put Process Subset//Grid/Format In- put ADaM Servers Web Interfaces & Applications AMSU-A Images Temperature Trends STT Application Visualization & Exploration FTP Cyclone Winds Data Ordering
    • 22. Interoperability: Accessing Heterogeneous Data
      • Science data comes in:
        • Different formats, types and structures
        • Different states of processing (raw, calibrated, derived, modeled or interpreted)
        • Enormous volumes
      • Heterogeneity leads to data usability problems
      • One approach: Standard data formats
        • Difficult to implement and enforce
        • Can’t anticipate all needs
          • Some data can’t be modeled or is lost in translation
        • The cost of converting legacy data
      • A better approach: Interchange Technologies
        • Earth Science Markup Language
      The Problem DATA FORMAT 1 DATA FORMAT 2 DATA FORMAT 3 READER 1 READER 2 FORMAT CONVERTER ESML LIBRARY APPLICATION DATA FORMAT 1 DATA FORMAT 2 DATA FORMAT 3 The Solution APPLICATION ESML FILE ESML FILE ESML FILE
    • 23. Chained Image Processing Services Data Data Files ESML WMS (Java/Windows) Draw Image (PERL/C – Linux) Data Files Knowledge Base Service Chaining is used to integrate modules – or services – developed on distributed platforms and different languages for a single processing solution. GeoCrop (Perl/Linux) Resample (Perl/C – Linux) Format (Perl/Linux) Data Streams Chained Services ESML Lib Reader (Java/C+ Windows)
    • 24. Data Integration using Web Mapping Services Globe AMSU-A Knowledge Base ITSC Coastlines Countries MCS Events Cyclone Events AMSU-A Channel 01 AMSU-A data overlaid with MCS and Cyclone events for September 2000, merged with world boundaries from Globe.
    • 25. Fused Displays from Multiple Servers Analysis: Correlate MCSs and cyclones with atmospheric temperatures for September 2000.
    • 26. Model and Observation Data FEATURE I FEATURE II FEATURE III FEATURE SET I EVENT A FEATURE X FEATURE Y EVENT B CONCEPTUAL LEVEL CONCEPT MINING DATA FILE LEVEL DECISION SUPPORT MULTI-LEVEL MINING Concept Hierarchy for Data Mining and Fusion
    • 27. On-Board Real-Time Processing Sensor Control/Targeting
      • Anomaly detection
      • Data Mining
      • Autonomous Decision Making
      • Immediate response
      • Direct satellite to Earth delivery of results
      EVE – Environment for On-board Processing www.itsc.uah.edu/eve
    • 28. A Reconfigurable Web of Interacting Sensors Ground Network Ground Network Ground Network Military Weather Satellite Constellations Communications
    • 29. Example Plan: Threshold events in AMSU-A Streaming Data EVE
    • 30. Data Integration and Mining: From Global Information to Local Knowledge Precision Agriculture Emergency Response Weather Prediction Urban Environments
    • 31.
      • What is the most effective approach to developing an integrated framework and plan for an interdisciplinary environmental cyberinfrastructure?
      • What organizational structure is needed to provide long-term support for data storage, access, model development, and services for a global clientele of researchers, educators, policy makers, and citizens?
      • How will effective interagency and public-private partnerships be formed to provide financial support for such an extensive and costly system?
      • How can communication and coordination among computer scientists and environmental researchers and educators be enhanced to develop this innovative, powerful, and accessible infrastructure?
      Key Questions:

    ×