Rapid Prototyping Capability for Earth-Sun System Sciences Preliminary Design Robert J. Moorhead Mississippi State University
Formulate architectures and develop baseline capacities that integrate applied sciences systems tools into configurations to support efficient evaluation of the prospects of integrating research results from NASA’s Earth observation systems (with emphasis on spacecraft instruments on missions recently launched or planned for near-term launch) and associated Earth system models
systems engineering tools
enterprise architecture tools
information visualization and analysis tools
uncertainty characterization tools
performance assessment tools
“ NASA Earth Science and Space Systems benefiting Society: Evolving Systems Engineering Capacity,” presentation by Ron Birk, August 24, 2005, SSC
Reduce the amount of time that has typically been required to consider the utility of new or future data streams on model outcomes.
Systematically evaluate research capabilities in a simulated operational environment in order to evaluate components and/or configurations that could be considered for verification, validation, and benchmarking for transition from research to operations and/or into an integrated system solution (ISS).
Figure 1 illustrates the interface between the RPC and external systems that include the SN and ISS components of NASA’s Earth Science Application Plan.
The RPC will provide the capability to integrate and provide access to the tools needed to evaluate the use of a wide variety of current and future NASA sensors and research results, model outputs, and knowledge, collectively referred to as “resources”.
It is assumed that the resources are geographically distributed and thus RPC will provide the support for the location transparency of the resources.
RPC node Local and remote computing and storage facilities Remote data providers Model configuration Input data sets configuration Experiment design and execution Analysis System administration and maintenance
System modes and states
Before an experiment can be performed (a particular model using a particular data source) two conditions must be satisfied.
First, the model must be installed at some computing facility assessable to RPC users, and configured to run;
Second, the data must be configured so that it can be used by the model. The data configuration may involve developing tools for the data conversions (format translations, subsetting, deriving values of variables not included in the original data products, geo-processing, etc).
From the point of view of performing a particular experiment and analysis, the RPC can be in two distinct states:
ready for the experiment and analysis by end users
requiring action of specialists for installing and configuring the model and its data
During its life cycle, new resources and tools will be integrated with the RPC node, increasing the repertoire of experiments and analyses that can be performed.
numerical model Model results Model results Model results analysis numerical model 1 Model results Model results analysis numerical model 2 Major Categories of Experiments Different sources Different models
Discovery, semantic understanding, secure access, and transport mechanisms for data products available from known data providers (Science Data Manager)
Data assimilation and geo-processing tools for all data transformations needed to match a given data product (or products) to the model input requirements, and support for organizing the data processing into workflows built from reusable and interoperable modules, including both the workflow specification mechanisms and the workflow enacting engine (Interoperable Geo-processing Environment)
Capabilities Required (cont.)
Catalog of available models, model metadata catalog (including input and output model requirements), and mechanisms for integrating new models with RPC
Mechanisms for creation runtime environments; data staging (in and out); job scheduling, remote execution, and monitoring
Mechanisms for storing model outputs together with metadata and provenance information (all information needed to recreate the output data set); the metadata necessary to enable search and discovery of model outputs
Tools for model output analysis (including visualizations), tools for quantitative comparing model outputs, and tools for model benchmarking (Performance Metrics Workbench)
Major System Constraints
Only models and data made available to RPC users and integrated with the RPC node can be used to perform experiments.
Installation and/or integration of models, as well as integration and geo-processing of data, needs to be performed by a respective specialist, and the time needed to accomplish that task will depend on the complexity of the particular model and data set(s).
Running a model may take a long time, depending on the complexity and configuration of the model. The experiments will not necessarily be performed in real time.
System administrators – responsible for deployment, configuration, and maintenance of the system, and its users (for access control purposes)
Application specialists – responsible for installation and configuration of the model on computational systems accessible to the RPC users, and integrating these models with the RPC (which includes definition of the input and output data requirements)
Data processing specialists – responsible for the development and the deployment of the tools for data transformations
Domain specialists – responsible for defining, configuring (creating workflows for data processing, setting model parameters, etc), and executing experiments
Domain specialist performing the data analysis
Assumptions and Dependencies
The RPC will depend on data and models provided by third parties.
Access to remote computational and storage facilities will be controlled according to policies established by the facility owners (stakeholders).
It is assumed that these policies will allow RPC users to submit and monitor jobs on these systems which may require penetrating firewalls.
It is possible that the access privileges will be different for different users, depending on organizational membership, nationality, or other factors beyond the control of the RPC system developers.
Operational Scenario Summary
Design of experiment – identification of models and data sets to be used
Assessment whether the models and data are currently integrated with the RPC node
Filling requests to model and data specialists, as needed; the specialists issue a notification when the models and data are available
Configuration of the experiment (setting the model parameters, configuring the data (e.g., ROI, timeframe, etc)
Asynchronous run and monitoring of the model
The RPC node will be installed on a dedicated, stand-alone system consisting of standard commercially available computing nodes, data storage, and hosting middleware servers.
Core RPC modular capabilities (SDM, IGE, MM, PMW) will be executed on separate computing nodes.
The RPC node will be complemented with remote resources – high performance computing and storage facilities as needed by the models to be used in the experiments.
The RPC node can be moved from one geographical location to another.
Access to the remote resources will require standard internet connections.
System Performance Characteristics
The primary goal of the RPC node is to provide the capability to rapidly prototype the assimilation of new or future NASA data products and/or model derived data streams into model applications that have generated demonstrable scientific results of merit and stakeholder interest.
However, there is no established benchmark to quantitatively specify what “rapid” means. The reference point is the current practice – manual configuration of data and models, whereas the expectation is that the RPC approach will considerably speed up the process, in particular for repeated experiments, after the baseline data and models are set up.
However, the initial phase – setting the baseline data and models – may prove to be time consuming as it will involve model integration, data acquisition and simulation, and the development of new components for geoprocessing the data.
System Performance Characteristics
“ Rapid Prototyping” performance benefits will be best realized through the reusability of configured geoprocessing tasks to provide model-ready input data to a model that has been fully integrated into the RPC.
It is this “reuse” capability that will enable the rapid evaluation of new data types.
By associating existing geoprocessing workflows with new data types, the rapid assimilation of next-generation data into configured models should be readily achievable.
Policy and Regulation
As the RPC develops into a viable simulation system, it is expected that activities requiring RPC resources will be requested and coordinated among those selecting an RPC for evaluation, the RPC team conducting a specific evaluation, and RPC developers who will be required to maintain and evolve the RPC to support requirements for integrating new model applications, data products, and geoprocessing tasks.
As the RPC evolves to meet new or changing requirements, configuration management practices, version control, and developmental practices will be followed to ensure that capabilities in development will be isolated from operational RPC capabilities.
Policy and Regulation
Simply stated, development activities, testing, and integration of new functionalities into the RPC should be “contained” through the use of segregated physical or virtual systems that may be isolated from the operational instance of the RPC.
As new capabilities mature through development processes, configuration “check-in” procedures will be followed to ensure the orderly integration of the new “proven” capabilities.
It is likely that such activities will involve proactive participation of an RPC technical working group.
The RPC node has 5 categories of users, each requiring a dedicated interface.
In addition, the RPC interacts with two classes of external systems: data providers and remote computing and storage facilities.
Each interface will be described in the remaining slides
System Administrator Interface
The administrator interface must support the administrator tasks:
registering and de-registering users and assigning roles
maintaining the user credentials needed to access remote resources
monitoring the system status and usage
backing up and restoring data and software; recovery from faults
deployment of new software components and services
Model Specialist Interface
The model specialist is responsible for deploying and integrating the models into the RPC environment.
The models can be installed either locally on RPC node hardware and/or at a remote computing facility.
To integrate the model with RPC the specialist must “ register ” the model, that is, generate a metadata record that describes the model in terms of its functionality, the runtime requirements (location of the executable, environmental variables, the structure of the working directory, etc.), model parameters, and definition of the input and output datasets.
The model specialist interface must thus support the registration of new models and editing of the metadata of the existing models.
In addition, the model specialist interface must provide support for the testing of the correctness of the model deployment.
Data Specialist Interface
The data specialist identifies the data providers and designs the geo-processing procedure for transforming the original data product to match the model input data requirements.
The design of the geo-processing may require the development and deployment of software components to perform specified tasks.
The data specialist interface must provide support for:
searching data products from known data providers
assessing the structure and syntax of available data products
assessing the model input data requirements
discovering and evaluating the geo-processing modules already integrated with the RPC node
integrating new geo-processing modules within the RPC node
composing the geo-processing process from available components
testing of the correctness of the geo-processing procedure
Domain Specialist Interface
To support the design and execution of experiments, the domain specialist interface must support:
Discovery of available models and data through the RPC facilities
Receiving and filling requests for new models and data
Configuring experiments by
Connecting a particular model with particular data
Setting the model parameters
Configuring datasets (region of interest, timeframe, etc.)
Submitting models for execution
Monitoring the model progress
Controlling the model execution (e.g., aborting it, if needed)
Verifying that the model completed successfully (e.g., by examining a log file generated by the model, running a test applications, etc.)
The analyst analyses the experiment outcome. The analyst interface must:
Allow queries of the output data databases to find the model outputs of interest
Provide access to model outputs
Provide access to model provenance (when and in what circumstances the model has been run, e.g., what input data sets has been used, the values of the model parameters, etc.)
Provide access to tools (visualizations or otherwise) enabling access to the results of the experiments
Data Provider Interface
The RPC must define interfaces that allow acceptance of data streams coming from data providers.
Remote Resources Interface
The RPC must define interfaces for invoking Grid services such as allocating and monitoring remote resources, accepting notifications about status changes (i.e., a job has completed), and data transfers between RPC node and remote resources, as well as data transfers between remote resources.
Defined interfaces must support delegation of user credentials to satisfy the access control requirements and policies of the remote resources.
The End Backup slides follow
The baseline system. This four-tier architecture follows OGSA recommendations
Evaluations leading to new understanding & ideas for ISS MyRPC LIS Functional computational capabilities of the RPC system IGE
Grid enabled OGC Services
RPC Portal MyRPC GCMD Service oriented architecture for Computational RPC Node [based on NSF LEAD (Drogemeier et. al., 2006)] WRF, HSPF LIS, RAMS DAACs CLASS Evaluation ESMF, GEOLEM OGC Services
CRPN WRF ESMF IGE GCMD Systems framework for CRPN, consisting of interacting subsystems in the secure and stable RPC computational grid [based on NSF LEAD (Drogemeier et. al., 2006)] MyRPC workspace LIS WorldWinds