Gateway Optimal Resource Selection and Integrated
                                                     Information Services Framework
                                              Xuan Wu, Deivasigamani Suresh Kumar, Raminder Singh, Suresh Marru, Marlon Pierce
                                                              Pervasive Technology Institute, Indiana University


                                                      Goals                                                           Usage Scenario
                                                                                                                          Usage
      Various TeraGrid information and monitoring services provide                      One Time Registration:
      valuable information but are scattered among TeraGrid                             • Step A: Register gateway used community and research
      Information Services (TGIS), TeraGrid news maintenance feed,                        applications and any available performance data.
      INCA resource monitoring, Karnak system status, start time and
                                                                                        • Step B: Register all gateway resources and the utilized grid
      wait time prediction and Speedpage file transfer monitoring and
                                                                                          middleware like GRAM5, GridFTP.
      estimates. The TGIS provides one stop shop for these services,
      but the data is discrete.                                                         Example Queries:
      Gateways would like to get the information they need by a single                  1. Given a job configuration (number of processors, wall time),
      query. Motivated by the needs of LEAD, UltraScan and GridChem                        return all healthy resources sorted by their lowest start time.
      gateways, we developed Optimal Resource Prediction Service                        2. Gateway Specific Resource Summary: List the status of all
      (ORPS). The goals of ORPS are:                                                       resources used by a gateway (Ultrascan, GridChem). Status of
      • Integrate all information sources into a single yes or no                       • Job Management, File Transfer, GSISSH and login nodes
         answer. Is the resource healthy should verify if the machine is                • Over all health. GOOD if all above three services are healthy;
         in maintenance, at least one file transfer service and a job                      BAD if any one of them is down, UNKNOWN if any testing
         management service are up and functional.                                         results returns unknown; SCHEDULED MAINTAINCE.
      • Predict which resources will be optimal to run the next                         • Current Load: number of waiting/running jobs and usage.
         compute job based on information provided by karnak start                                     Hosted Service or Download & Deploy
         time prediction, pre-determined application performance and
         estimated run-time.                                                            Hosted Service Example:
                                                                                        http://ogceportal.iu.teragrid.org:19444/orps-
                                                  Architecture                          service/XML/gateway/$(gatewayId)
                                                                                        Download, build and deploy:
                                                                                        1. svn co https://ogce.svn.sourceforge.net/svnroot/ogce/incubator/ORPS
                                                                                        2. mvn clean install
                                                                                        3. Configure data collection schedules, database for caching, ports
                                                                                        4. start.sh




                                                 Salient Features
      • ORPS is a flexible and extensible architecture developed in                                               Status & Future Work
        java over the Spring MVC framework. The framework adapts to                     Phase I (completed):
        the emerging information sources.                                               • ORPS is currently integrated into UltraScan production gateway.
      • External information services send information though                           • Working with GridChem gateway developers to integrate into
        subscriptions (push) or by periodic polls.                                        development environment.
      • The scheduler polls different sources based on their update                     Phase II (in development):
        frequency and data is provided downstream in near-real-time.                    • Application specific scheduling: get all healthy resources to run
      • ORPS exposes the raw & mashed up information to gateways                          Gaussian on TeraGrid. Selection based on: Queue wait time +
        through REST interfaces.                                                          Gaussian relative performance data + bandwidth estimates
      • Information sources update schedule-aware multi-level                                                    Acknowledgement
        databases cache to serve surge of job submission requests
        from gateways.                                                                  The Authors would like to thank the INCA, TGIS and Karnak teams
                                                                                        for valuable discussions and support and UltraScan and GridChem
      • The determined health and schedule is cached in second level                    gateways for requirement and integration.
        to ensure quick response time < 100ms.
                                                                                        This work is partially supported by TeraGrid Gateway Advanced
      • Detailed test failures are provided to assist in determining                    Support Activity and Open Gateway Computing Environments NSF
        transient vs persistent failures.                                               SDCI Grant No: OCI-1032742.
RESEARCH POSTER PRESENTATION DESIGN © 2011

www.PosterPresentation
s.com

TG11 ORPS Poster

  • 1.
    Gateway Optimal ResourceSelection and Integrated Information Services Framework Xuan Wu, Deivasigamani Suresh Kumar, Raminder Singh, Suresh Marru, Marlon Pierce Pervasive Technology Institute, Indiana University Goals Usage Scenario Usage Various TeraGrid information and monitoring services provide One Time Registration: valuable information but are scattered among TeraGrid • Step A: Register gateway used community and research Information Services (TGIS), TeraGrid news maintenance feed, applications and any available performance data. INCA resource monitoring, Karnak system status, start time and • Step B: Register all gateway resources and the utilized grid wait time prediction and Speedpage file transfer monitoring and middleware like GRAM5, GridFTP. estimates. The TGIS provides one stop shop for these services, but the data is discrete. Example Queries: Gateways would like to get the information they need by a single 1. Given a job configuration (number of processors, wall time), query. Motivated by the needs of LEAD, UltraScan and GridChem return all healthy resources sorted by their lowest start time. gateways, we developed Optimal Resource Prediction Service 2. Gateway Specific Resource Summary: List the status of all (ORPS). The goals of ORPS are: resources used by a gateway (Ultrascan, GridChem). Status of • Integrate all information sources into a single yes or no • Job Management, File Transfer, GSISSH and login nodes answer. Is the resource healthy should verify if the machine is • Over all health. GOOD if all above three services are healthy; in maintenance, at least one file transfer service and a job BAD if any one of them is down, UNKNOWN if any testing management service are up and functional. results returns unknown; SCHEDULED MAINTAINCE. • Predict which resources will be optimal to run the next • Current Load: number of waiting/running jobs and usage. compute job based on information provided by karnak start Hosted Service or Download & Deploy time prediction, pre-determined application performance and estimated run-time. Hosted Service Example: http://ogceportal.iu.teragrid.org:19444/orps- Architecture service/XML/gateway/$(gatewayId) Download, build and deploy: 1. svn co https://ogce.svn.sourceforge.net/svnroot/ogce/incubator/ORPS 2. mvn clean install 3. Configure data collection schedules, database for caching, ports 4. start.sh Salient Features • ORPS is a flexible and extensible architecture developed in Status & Future Work java over the Spring MVC framework. The framework adapts to Phase I (completed): the emerging information sources. • ORPS is currently integrated into UltraScan production gateway. • External information services send information though • Working with GridChem gateway developers to integrate into subscriptions (push) or by periodic polls. development environment. • The scheduler polls different sources based on their update Phase II (in development): frequency and data is provided downstream in near-real-time. • Application specific scheduling: get all healthy resources to run • ORPS exposes the raw & mashed up information to gateways Gaussian on TeraGrid. Selection based on: Queue wait time + through REST interfaces. Gaussian relative performance data + bandwidth estimates • Information sources update schedule-aware multi-level Acknowledgement databases cache to serve surge of job submission requests from gateways. The Authors would like to thank the INCA, TGIS and Karnak teams for valuable discussions and support and UltraScan and GridChem • The determined health and schedule is cached in second level gateways for requirement and integration. to ensure quick response time < 100ms. This work is partially supported by TeraGrid Gateway Advanced • Detailed test failures are provided to assist in determining Support Activity and Open Gateway Computing Environments NSF transient vs persistent failures. SDCI Grant No: OCI-1032742. RESEARCH POSTER PRESENTATION DESIGN © 2011 www.PosterPresentation s.com

Editor's Notes

  • #2 The framework provides application registry capabilities to register the resources and applications used by a gateway. Application performance models can be plugged to update performance data on a specific host. Once registered the gateway can query for real time status information and the framework will provide status determined by ensuring the required File Transfer and Job Management interfaces are healthy. In a first order, resources in maintenance, faulty job managers, overwhelmed gridftp servers are eliminated for scheduling. Further marshaling the karnak and speed page job queue and file transfer information increases gateway job success rates and turn around times.