TG11 ORPS Poster


Published on

Poster given at TG11, Salt Lake City, July 2011 by Xuan Wu. Raminder Singh, Suresh Marru, Suresh Kumar Deivasigamani, and Marlon Pierce

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • The framework provides application registry capabilities to register the resources and applications used by a gateway. Application performance models can be plugged to update performance data on a specific host. Once registered the gateway can query for real time status information and the framework will provide status determined by ensuring the required File Transfer and Job Management interfaces are healthy. In a first order, resources in maintenance, faulty job managers, overwhelmed gridftp servers are eliminated for scheduling. Further marshaling the karnak and speed page job queue and file transfer information increases gateway job success rates and turn around times.
  • TG11 ORPS Poster

    1. 1. Gateway Optimal Resource Selection and Integrated Information Services Framework Xuan Wu, Deivasigamani Suresh Kumar, Raminder Singh, Suresh Marru, Marlon Pierce Pervasive Technology Institute, Indiana University Goals Usage Scenario Usage Various TeraGrid information and monitoring services provide One Time Registration: valuable information but are scattered among TeraGrid • Step A: Register gateway used community and research Information Services (TGIS), TeraGrid news maintenance feed, applications and any available performance data. INCA resource monitoring, Karnak system status, start time and • Step B: Register all gateway resources and the utilized grid wait time prediction and Speedpage file transfer monitoring and middleware like GRAM5, GridFTP. estimates. The TGIS provides one stop shop for these services, but the data is discrete. Example Queries: Gateways would like to get the information they need by a single 1. Given a job configuration (number of processors, wall time), query. Motivated by the needs of LEAD, UltraScan and GridChem return all healthy resources sorted by their lowest start time. gateways, we developed Optimal Resource Prediction Service 2. Gateway Specific Resource Summary: List the status of all (ORPS). The goals of ORPS are: resources used by a gateway (Ultrascan, GridChem). Status of • Integrate all information sources into a single yes or no • Job Management, File Transfer, GSISSH and login nodes answer. Is the resource healthy should verify if the machine is • Over all health. GOOD if all above three services are healthy; in maintenance, at least one file transfer service and a job BAD if any one of them is down, UNKNOWN if any testing management service are up and functional. results returns unknown; SCHEDULED MAINTAINCE. • Predict which resources will be optimal to run the next • Current Load: number of waiting/running jobs and usage. compute job based on information provided by karnak start Hosted Service or Download & Deploy time prediction, pre-determined application performance and estimated run-time. Hosted Service Example: Architecture service/XML/gateway/$(gatewayId) Download, build and deploy: 1. svn co 2. mvn clean install 3. Configure data collection schedules, database for caching, ports 4. Salient Features • ORPS is a flexible and extensible architecture developed in Status & Future Work java over the Spring MVC framework. The framework adapts to Phase I (completed): the emerging information sources. • ORPS is currently integrated into UltraScan production gateway. • External information services send information though • Working with GridChem gateway developers to integrate into subscriptions (push) or by periodic polls. development environment. • The scheduler polls different sources based on their update Phase II (in development): frequency and data is provided downstream in near-real-time. • Application specific scheduling: get all healthy resources to run • ORPS exposes the raw & mashed up information to gateways Gaussian on TeraGrid. Selection based on: Queue wait time + through REST interfaces. Gaussian relative performance data + bandwidth estimates • Information sources update schedule-aware multi-level Acknowledgement databases cache to serve surge of job submission requests from gateways. The Authors would like to thank the INCA, TGIS and Karnak teams for valuable discussions and support and UltraScan and GridChem • The determined health and schedule is cached in second level gateways for requirement and integration. to ensure quick response time < 100ms. This work is partially supported by TeraGrid Gateway Advanced • Detailed test failures are provided to assist in determining Support Activity and Open Gateway Computing Environments NSF transient vs persistent failures. SDCI Grant No: OCI-1032742.RESEARCH POSTER PRESENTATION DESIGN ©