Hasthi Lead Integration: A Case Study on System Management

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Hasthi Lead Integration: A Case Study on System Management - Presentation Transcript

    1. Application of Management Frameworks to Manage Workflow-based Systems: A Case Study on a Large Scale E-Science Project Srinath Perera, Suresh Marru, Thilina Gunarathne, Dennis Gannon, Beth Plale Indiana University, Bloomington
    2. SOA => Many Service Systems
      • SOA leads to many Services Systems
      • Good: it is distributed, loosely coupled etc, but
      • Bad: Not very easy to manage, specially if it is distributed across many machines
      • Ugly: System Management/ Administration Nightmare
      • So with many Service Systems--most of them are reasonably large scale---Systems management has become important as ever!
    3. I have a System Management framework, am I Done?
    4. Application of System Management is not Simple (some problems).
      • Building a generic framework for actions and monitoring agents.
      • Identifying/ formulating management scenarios given a system.
      • Handling the lost state in failed managed services, what about lost messages?
      • What if Management action has failed, avoiding loops if a management action has failed.
      • Notifying other services if a service location has changed after recovery.
    5. Case Study Based on Large Scale E-Science Project
      • Enable Scientist to find interesting condition from weather data collected across united States, process them using National Computation resources (TeraGrid), and manage weather data, results, and their provenance
      • Build using SOA based architecture, have 13+ persistent services and many services created on demand.
      • Enforces Undefined Management Logic (expressed as rules), and has a global view of the system.
      • Scalable (to manage about 100,1000 services).
      • Robust -(Self-organizing, recovers from failures of both resources and management framework)
      • Dynamic (discover components, keep track when resources join and leave)
      Hasthi Management Framework
    6. Proposed Integration Model of Hasthi with a Given System
    7. Types of Management Agents
      • Action Types
        • Create a New service
        • Restart a running service or recover a failed service
        • Relocate a service
        • Tune and configure a resource – change the configuration of a resource or change the structure of the system.
        • User Interaction Action
      • Actions implementation:
        • Use shell scripts (e.g. service start or stop) and execute them using a Host Agent running in each host.
        • Use Hasthi Agent integrated with each resource.
        • Hasthi provides default management actions, but users can write their own.
      Management Actions
    8. Handling Lost State
      • If Service writes its state to a storage location and exposes the location as a parameter, Hasthi passes that location as a Argument to the new service.
      • Hasthi acts as a Service registry, and helps services to find instances of other dependency services by a lookup. So services can recover other services via the lookup if a dependency service failed or at initialization.
    9. Failed Management Actions
      • Resource life cycle avoid Loops
      • User interactions to delegate fixing the error to human users (send a email to user, user responds via clicking a link)
    10. Fail Positives
      • Vary Hard Problem, fact of systems.
      • We use heartbeat + timeouts as indicators and trigger (pluggable) failure detectors (e.g. active pings, functional tests).
      • Other Services timeouts can raise a faulty suspect conditions and custom failure detectors are activated.
    11. LEAD E-Science Project
      • We confirmed 80-20 rule by analyzing LEAD error data over an 18 months period where 30/80 (37%) different error types were responsible for 95% of all error occurrences.
      • LEAD services write data to a database at once, and has best effort global state (explain).
      • Handling Errors in LEAD
        • Execution Errors – handled by multiple levels of retires (e.g. file transfers / job submission retries,, run executions in different computational resources, part of LEAD).
        • Hasthi handles infrastructure errors, and then recover failed workflows due to those errors.
    12. Usecase As Rules
      • Condition and a Action.
      • Failed Recovered Services by restarting or moving (Real Rules can be complicated)
    13. Rules: Detect Failed System, and Restart Workflows after failure.
    14. Workflow Recovery
      • Hasthi recovers LEAD from services and host failures and recovers failed workflows.
      • A) Killed a service B) killed a host and measured the time to detect, trigger actions, new resources to join, and detect healthy conditions. Take about 2 minutes to recover the system and to know it is healthy.
      Evaluation: LEAD Integration
    15. What does results Mean?
      • Assume MTTF of a service is f, and services are independent. Then MTTF of the system is f/26 (by Baumann [8] assume 26 services).
      • Using MTTR from above results, and assuming Hasthi do not fail, Then Availability of the system is
      • That is Availability of 0.995, 0.997, 0.999 with MTTF of 1 week, 2 weeks, 1months per service, which is 46.8, 26.3, and 8.8 hours downtime per year .
    16. Demo (If we have time)
      • http://www.extreme.indiana.edu/hasthi/lead/screencasts/hasthi4.htm
    17. Questions

    + Srinath PereraSrinath Perera, 3 months ago

    custom

    323 views, 0 favs, 1 embeds more stats

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 323
      • 314 on SlideShare
      • 9 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 1
    Most viewed embeds
    • 9 views on http://srinathsview.blogspot.com

    more

    All embeds
    • 9 views on http://srinathsview.blogspot.com

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories