Hasthi Lead Integration: A Case Study on System Management

1,385 views

Published on

Published in: Business, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,385
On SlideShare
0
From Embeds
0
Number of Embeds
74
Actions
Shares
0
Downloads
17
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Hasthi Lead Integration: A Case Study on System Management

  1. 1. Application of Management Frameworks to Manage Workflow-based Systems: A Case Study on a Large Scale E-Science Project Srinath Perera, Suresh Marru, Thilina Gunarathne, Dennis Gannon, Beth Plale Indiana University, Bloomington
  2. 2. SOA => Many Service Systems <ul><li>SOA leads to many Services Systems </li></ul><ul><li>Good: it is distributed, loosely coupled etc, but </li></ul><ul><li>Bad: Not very easy to manage, specially if it is distributed across many machines </li></ul><ul><li>Ugly: System Management/ Administration Nightmare </li></ul><ul><li>So with many Service Systems--most of them are reasonably large scale---Systems management has become important as ever! </li></ul>
  3. 3. I have a System Management framework, am I Done?
  4. 4. Application of System Management is not Simple (some problems). <ul><li>Building a generic framework for actions and monitoring agents. </li></ul><ul><li>Identifying/ formulating management scenarios given a system. </li></ul><ul><li>Handling the lost state in failed managed services, what about lost messages? </li></ul><ul><li>What if Management action has failed, avoiding loops if a management action has failed. </li></ul><ul><li>Notifying other services if a service location has changed after recovery. </li></ul>
  5. 5. Case Study Based on Large Scale E-Science Project <ul><li>Enable Scientist to find interesting condition from weather data collected across united States, process them using National Computation resources (TeraGrid), and manage weather data, results, and their provenance </li></ul><ul><li>Build using SOA based architecture, have 13+ persistent services and many services created on demand. </li></ul>
  6. 6. <ul><li>Enforces Undefined Management Logic (expressed as rules), and has a global view of the system. </li></ul><ul><li>Scalable (to manage about 100,1000 services). </li></ul><ul><li>Robust -(Self-organizing, recovers from failures of both resources and management framework) </li></ul><ul><li>Dynamic (discover components, keep track when resources join and leave) </li></ul>Hasthi Management Framework
  7. 7. Proposed Integration Model of Hasthi with a Given System
  8. 8. Types of Management Agents
  9. 9. <ul><li>Action Types </li></ul><ul><ul><li>Create a New service </li></ul></ul><ul><ul><li>Restart a running service or recover a failed service </li></ul></ul><ul><ul><li>Relocate a service </li></ul></ul><ul><ul><li>Tune and configure a resource – change the configuration of a resource or change the structure of the system. </li></ul></ul><ul><ul><li>User Interaction Action </li></ul></ul><ul><li>Actions implementation: </li></ul><ul><ul><li>Use shell scripts (e.g. service start or stop) and execute them using a Host Agent running in each host. </li></ul></ul><ul><ul><li>Use Hasthi Agent integrated with each resource. </li></ul></ul><ul><ul><li>Hasthi provides default management actions, but users can write their own. </li></ul></ul>Management Actions
  10. 10. Handling Lost State <ul><li>If Service writes its state to a storage location and exposes the location as a parameter, Hasthi passes that location as a Argument to the new service. </li></ul><ul><li>Hasthi acts as a Service registry, and helps services to find instances of other dependency services by a lookup. So services can recover other services via the lookup if a dependency service failed or at initialization. </li></ul>
  11. 11. Failed Management Actions <ul><li>Resource life cycle avoid Loops </li></ul><ul><li>User interactions to delegate fixing the error to human users (send a email to user, user responds via clicking a link) </li></ul>
  12. 12. Fail Positives <ul><li>Vary Hard Problem, fact of systems. </li></ul><ul><li>We use heartbeat + timeouts as indicators and trigger (pluggable) failure detectors (e.g. active pings, functional tests). </li></ul><ul><li>Other Services timeouts can raise a faulty suspect conditions and custom failure detectors are activated. </li></ul>
  13. 13. LEAD E-Science Project <ul><li>We confirmed 80-20 rule by analyzing LEAD error data over an 18 months period where 30/80 (37%) different error types were responsible for 95% of all error occurrences. </li></ul><ul><li>LEAD services write data to a database at once, and has best effort global state (explain). </li></ul><ul><li>Handling Errors in LEAD </li></ul><ul><ul><li>Execution Errors – handled by multiple levels of retires (e.g. file transfers / job submission retries,, run executions in different computational resources, part of LEAD). </li></ul></ul><ul><ul><li>Hasthi handles infrastructure errors, and then recover failed workflows due to those errors. </li></ul></ul>
  14. 14. Usecase As Rules <ul><li>Condition and a Action. </li></ul><ul><li>Failed Recovered Services by restarting or moving (Real Rules can be complicated) </li></ul>
  15. 15. Rules: Detect Failed System, and Restart Workflows after failure.
  16. 16. Workflow Recovery
  17. 17. <ul><li>Hasthi recovers LEAD from services and host failures and recovers failed workflows. </li></ul><ul><li>A) Killed a service B) killed a host and measured the time to detect, trigger actions, new resources to join, and detect healthy conditions. Take about 2 minutes to recover the system and to know it is healthy. </li></ul>Evaluation: LEAD Integration
  18. 18. What does results Mean? <ul><li>Assume MTTF of a service is f, and services are independent. Then MTTF of the system is f/26 (by Baumann [8] assume 26 services). </li></ul><ul><li>Using MTTR from above results, and assuming Hasthi do not fail, Then Availability of the system is </li></ul><ul><li>That is Availability of 0.995, 0.997, 0.999 with MTTF of 1 week, 2 weeks, 1months per service, which is 46.8, 26.3, and 8.8 hours downtime per year . </li></ul>
  19. 19. Demo (If we have time) <ul><li>http://www.extreme.indiana.edu/hasthi/lead/screencasts/hasthi4.htm </li></ul>
  20. 20. Questions

×