Challenges for next-gen data mining

427 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
427
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Challenges for next-gen data mining

  1. 1. Focus Study: Mining on the Grid with ADaM Sara Graves Sandra Redman Information Technology and Systems Center and Information Technology Research Center University of Alabama in Huntsville National Space Science and Technology Center 256-961-7806 [email_address] [email_address] www.itsc.uah.edu
  2. 2. Data Mining <ul><li>Automated discovery of patterns, anomalies from vast observational data sets </li></ul><ul><li>Derived knowledge for decision making, predictions and disaster response </li></ul><ul><li>http://datamining.itsc.uah.edu </li></ul>
  3. 3. Creating a Successful Environment for Data Mining <ul><li>Provide scientists with the capabilities to allow the flexibility of creative scientific analysis </li></ul><ul><li>Provide data mining benefits of </li></ul><ul><ul><ul><li>Automation of the analysis process </li></ul></ul></ul><ul><ul><ul><li>Reducing data volume </li></ul></ul></ul><ul><li>Provide a framework to allow a well defined structure to the entire process </li></ul><ul><li>Provide a suite of mining algorithms for creative analysis that can adapt to new hypotheses </li></ul><ul><li>Provide capabilities to add science algorithms to the environment </li></ul><ul><li>Exploit emerging technologies in computational and data grids, high-performance networks, and collaborative environments </li></ul>
  4. 4. <ul><li>Develop and document common/standard interfaces for interoperability of data and services </li></ul><ul><li>Design new data models for handling </li></ul><ul><ul><li>real-time/streaming input </li></ul></ul><ul><ul><li>data fusion/integration </li></ul></ul><ul><li>Design and develop distributed standardized catalog capabilities </li></ul><ul><li>Develop advanced resource allocation and load balancing techniques </li></ul><ul><li>Exploit the grid concept for enhanced data mining functionality </li></ul><ul><li>Develop more intelligent and intuitive user interfaces </li></ul><ul><li>Integrate with collaborative environments </li></ul><ul><li>Develop ontologies of scientific data, processes and data mining techniques for multiple domains </li></ul><ul><li>Support language and system independent components </li></ul><ul><li>Incorporate data mining into science and engineering curricula </li></ul>Challenges for Next-generation Mining
  5. 5. Algorithm Development and Mining System (ADaM) - System Overview <ul><li>Consists of over 100 interoperable mining and image processing components </li></ul><ul><li>Each component is provided with a C++ application programming interface (API), an executable in support of scripting tools (e.g. Perl, Python, Tcl, Shell) </li></ul><ul><li>ADaM components are lightweight and autonomous, and have been used successfully in a grid environment (NASA IPG, TeraGrid, lab) </li></ul><ul><li>ADaM has several translation components that provide data level interoperability with other mining systems (such as WEKA and Orange), and point tools (such as libSVM and svmLight) </li></ul><ul><li>Web service interfaces in development </li></ul><ul><li>Executes in multiple environments (e.g. workstation, cluster, grid, on-board, etc.) </li></ul><ul><li>NMI Integration Testbed test cases </li></ul>
  6. 6. MEAD Modeling Environment for Atmospheric Discovery <ul><li>One of the NSF PACI Alliance research Expeditions </li></ul><ul><li>Expeditions ensure intense collaboration among technology developers and application scientists and focus on the deployment of infrastructure that supports computational science and engineering and science in a variety of disciplines </li></ul><ul><li>MEAD’s focus is on retrospective analysis of hurricanes and severe storms using the TeraGrid, integrating computation, grid workflow management, data management, model coupling, data analysis/mining , and visualization </li></ul>
  7. 7. MEAD Mining Example: Mesocyclone Detection Algorithm <ul><li>Science Objective: </li></ul><ul><ul><li>To investigate different thunderstorm cell interactions favorable for subsequent tornado (mesocyclone) formation </li></ul></ul><ul><li>Goals: </li></ul><ul><ul><li>Develop a mesocyclone detection algorithm (in both 2D and 3D) </li></ul></ul><ul><ul><li>Develop an algorithm to track the temporal evolution of the mesocyclone features </li></ul></ul><ul><ul><li>Investigate the use of clustering techniques to: </li></ul></ul><ul><ul><ul><li>Summarize differences in simulation runs </li></ul></ul></ul><ul><ul><ul><li>Provide an overview of all the simulations </li></ul></ul></ul>
  8. 8. Approach <ul><li>Mining Approach </li></ul><ul><ul><li>Use idealized WRF model simulations with different initial conditions </li></ul></ul><ul><ul><li>Create a large parameter space of thunderstorm cell interaction and storm behavior </li></ul></ul><ul><ul><li>Mine this search space for patterns and trends </li></ul></ul><ul><li>Grid Approach </li></ul><ul><ul><li>Application scripts developed in Python and tested on linux; modified for Globus environment by writing a simple Globus RSL file </li></ul></ul><ul><ul><li>Application scripts constructed to run each combination of tools in parallel on a different node on the grid </li></ul></ul>
  9. 9. Example MEAD Workflow Initial Data and Parameters Initial Data and Parameters Multiple WRF Models (Weather) Multiple ROMS Models (Ocean) Data Mining (ADaM) Visualization Inter-model communications Initial Setup Model Execution Post Run Analysis Model Results Model Results Grid environment supports the demanding computational, data storage and post analysis requirements
  10. 10. Using the TeraGrid <ul><li>Excellent user documentation at http://www.teragrid.org/userinfo/ </li></ul><ul><li>Account Management - Procedures vary per site </li></ul><ul><ul><li>Get account at each site </li></ul></ul><ul><ul><li>Obtain certificate (from one of several sites, X.509 or KX.509) </li></ul></ul><ul><ul><li>Establish Distinguished Name in grid-mapfile at each site </li></ul></ul><ul><ul><li>Create certificate proxy (grid-proxy-int, MyProxy, kinit) </li></ul></ul><ul><li>Programming Environment – Know your systems </li></ul><ul><ul><li>Compilers (you have a number of choices) </li></ul></ul><ul><ul><li>Environment Variables (SoftEnv) </li></ul></ul><ul><ul><li>Message Passing (several flavors available) </li></ul></ul><ul><li>Executing Jobs </li></ul><ul><ul><li>Condor-G </li></ul></ul><ul><ul><li>Globus </li></ul></ul>
  11. 11. WRF Initializations <ul><li>230 WRF runs were made, + two control (single-cell) </li></ul><ul><li>Each corresponded to a particular arrangement of a pair of initial storm cells </li></ul><ul><li>In figure at left: </li></ul><ul><ul><li>Each square: 1 simulation </li></ul></ul><ul><ul><li>1st storm in the middle; </li></ul></ul><ul><ul><li>2nd at one of blue squares </li></ul></ul><ul><ul><li>Center cell stronger </li></ul></ul>Matrix of WRF simulations Slide Source: Brian Jewett
  12. 12. Example: Tracking Results
  13. 13. Mesocyclone Detection and Tracking Results Features with time durations of a single time step are filtered out
  14. 14. Summary – Mesocyclone Detection <ul><li>Number of mesocyclones with higher duration tend to be associated with initializations where the second cell is closer to the first </li></ul><ul><li>Mesocyclones found in the storm simulations are sensitive to the particular arrangement of a pair of initial storm cells (secondary storm placement at 45 degrees to the primary storm) </li></ul><ul><li>Clustering techniques are useful </li></ul><ul><ul><li>Summarize differences in simulation runs </li></ul></ul><ul><ul><li>Provide an overview of all the simulations </li></ul></ul><ul><li>Limitations of Clustering algorithms </li></ul><ul><ul><li>Investigated K-Means, Dbscan, Maximin and Hiearchical Clustering Algorithms </li></ul></ul><ul><ul><li>K-Means clustering quality is inferior but provides useful cluster centers or profiles </li></ul></ul>
  15. 15. LEAD Linked Environments for Atmospheric Discovery <ul><li>A cyberinfrastructure for mesoscale meteorology </li></ul><ul><ul><li>real-time, on-demand, and dynamically adaptive needs for mesoscale weather research </li></ul></ul><ul><ul><li>High volume data sets and streams </li></ul></ul><ul><ul><li>Computationally demanding numerical models and data assimilation systems </li></ul></ul>
  16. 16. LEAD NSF Information Technology Research (ITR) program Multi-Disciplinary team contributing expertise in meteorological applications, analysis tools, forecast tools, data distribution and management, portal development, workflow orchestration, education and outreach
  17. 17. LEAD An integrated framework for identifying, accessing, preparing, assimilating, predicting, managing, analyzing, mining, and visualizing meteorological data, independent of format and physical location Dynamic workflow orchestration and data management are key elements
  18. 18. LEAD GWSTBs Grid and Web Services Testbeds <ul><ul><li>Local User Environment – customized portal, control of information flows, collaboration tools, managing processes </li></ul></ul><ul><ul><li>Productivity Environment – models, tools, and algorithms </li></ul></ul><ul><ul><li>Data Services Environment – data transport, data formatting, and interoperability </li></ul></ul><ul><ul><li>Distributed Technologies Environment – workflow infrastructure to autonomously acquire resources and adapt to changing plans </li></ul></ul><ul><ul><li>Data Archive – recent and historical data, products, and tools </li></ul></ul>
  19. 19. The Portal as a Grid Access Point <ul><li>The Portal Server provides the users Grid Context. </li></ul>Web Services Resource Framework – Web Services Notification OGCE or GridSphere Grid Portal Server https Physical Resource Layer SOAP & WS-Security Security Data Management Service Accounting Service Logging Event Service Policy Administration & Monitoring Grid Orchestration Registries and Name binding Reservations And Scheduling Open Grid Service Architecture Layer
  20. 20. Services Oriented Architecture <ul><li>User interfaces with portal via browser </li></ul><ul><li>Portal provides tools for users to build and launch workflows </li></ul><ul><li>Portlets (JSR-168) provide interface between user and grid services </li></ul><ul><li>Applications can be wrapped as services via a Portal Factory Service Generator </li></ul><ul><ul><li>Requires application, script to run it, input parameters, output parameters </li></ul></ul><ul><ul><li>Write an AppService document and upload to Portal Factory Service Generator (in portal) </li></ul></ul><ul><ul><li>Service is created as well as the portal client interface </li></ul></ul><ul><li>Security model integral to design </li></ul>
  21. 21. Data Integration and Mining: From Global Information to Local Knowledge Precision Agriculture Emergency Response Weather Prediction Urban Environments Bioinformatics

×