Hasthi Lead Integration: A Case Study on System ManagementPresentation Transcript
Application of Management Frameworks to Manage Workflow-based Systems: A Case Study on a Large Scale E-Science Project Srinath Perera, Suresh Marru, Thilina Gunarathne, Dennis Gannon, Beth Plale Indiana University, Bloomington
SOA => Many Service Systems
SOA leads to many Services Systems
Good: it is distributed, loosely coupled etc, but
Bad: Not very easy to manage, specially if it is distributed across many machines
Ugly: System Management/ Administration Nightmare
So with many Service Systems--most of them are reasonably large scale---Systems management has become important as ever!
I have a System Management framework, am I Done?
Application of System Management is not Simple (some problems).
Building a generic framework for actions and monitoring agents.
Identifying/ formulating management scenarios given a system.
Handling the lost state in failed managed services, what about lost messages?
What if Management action has failed, avoiding loops if a management action has failed.
Notifying other services if a service location has changed after recovery.
Case Study Based on Large Scale E-Science Project
Enable Scientist to find interesting condition from weather data collected across united States, process them using National Computation resources (TeraGrid), and manage weather data, results, and their provenance
Build using SOA based architecture, have 13+ persistent services and many services created on demand.
Enforces Undefined Management Logic (expressed as rules), and has a global view of the system.
Scalable (to manage about 100,1000 services).
Robust -(Self-organizing, recovers from failures of both resources and management framework)
Dynamic (discover components, keep track when resources join and leave)
Hasthi Management Framework
Proposed Integration Model of Hasthi with a Given System
Types of Management Agents
Create a New service
Restart a running service or recover a failed service
Relocate a service
Tune and configure a resource – change the configuration of a resource or change the structure of the system.
User Interaction Action
Use shell scripts (e.g. service start or stop) and execute them using a Host Agent running in each host.
Use Hasthi Agent integrated with each resource.
Hasthi provides default management actions, but users can write their own.
Handling Lost State
If Service writes its state to a storage location and exposes the location as a parameter, Hasthi passes that location as a Argument to the new service.
Hasthi acts as a Service registry, and helps services to find instances of other dependency services by a lookup. So services can recover other services via the lookup if a dependency service failed or at initialization.
Failed Management Actions
Resource life cycle avoid Loops
User interactions to delegate fixing the error to human users (send a email to user, user responds via clicking a link)
Vary Hard Problem, fact of systems.
We use heartbeat + timeouts as indicators and trigger (pluggable) failure detectors (e.g. active pings, functional tests).
Other Services timeouts can raise a faulty suspect conditions and custom failure detectors are activated.
LEAD E-Science Project
We confirmed 80-20 rule by analyzing LEAD error data over an 18 months period where 30/80 (37%) different error types were responsible for 95% of all error occurrences.
LEAD services write data to a database at once, and has best effort global state (explain).
Handling Errors in LEAD
Execution Errors – handled by multiple levels of retires (e.g. file transfers / job submission retries,, run executions in different computational resources, part of LEAD).
Hasthi handles infrastructure errors, and then recover failed workflows due to those errors.
Usecase As Rules
Condition and a Action.
Failed Recovered Services by restarting or moving (Real Rules can be complicated)
Rules: Detect Failed System, and Restart Workflows after failure.
Hasthi recovers LEAD from services and host failures and recovers failed workflows.
A) Killed a service B) killed a host and measured the time to detect, trigger actions, new resources to join, and detect healthy conditions. Take about 2 minutes to recover the system and to know it is healthy.
Evaluation: LEAD Integration
What does results Mean?
Assume MTTF of a service is f, and services are independent. Then MTTF of the system is f/26 (by Baumann  assume 26 services).
Using MTTR from above results, and assuming Hasthi do not fail, Then Availability of the system is
That is Availability of 0.995, 0.997, 0.999 with MTTF of 1 week, 2 weeks, 1months per service, which is 46.8, 26.3, and 8.8 hours downtime per year .