• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hasthi Lead Integration: A Case Study on System Management

Hasthi Lead Integration: A Case Study on System Management






Total Views
Views on SlideShare
Embed Views



9 Embeds 62

http://srinathsview.blogspot.com 46
http://srinathsview.blogspot.in 4
http://www.slideshare.net 3
http://srinathsview.blogspot.co.nz 2
http://srinathsview.blogspot.ca 2
http://srinathsview.blogspot.co.uk 2
http://srinathsview.blogspot.com.au 1
http://srinathsview.blogspot.de 1
https://www.linkedin.com 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Hasthi Lead Integration: A Case Study on System Management Hasthi Lead Integration: A Case Study on System Management Presentation Transcript

    • Application of Management Frameworks to Manage Workflow-based Systems: A Case Study on a Large Scale E-Science Project Srinath Perera, Suresh Marru, Thilina Gunarathne, Dennis Gannon, Beth Plale Indiana University, Bloomington
    • SOA => Many Service Systems
      • SOA leads to many Services Systems
      • Good: it is distributed, loosely coupled etc, but
      • Bad: Not very easy to manage, specially if it is distributed across many machines
      • Ugly: System Management/ Administration Nightmare
      • So with many Service Systems--most of them are reasonably large scale---Systems management has become important as ever!
    • I have a System Management framework, am I Done?
    • Application of System Management is not Simple (some problems).
      • Building a generic framework for actions and monitoring agents.
      • Identifying/ formulating management scenarios given a system.
      • Handling the lost state in failed managed services, what about lost messages?
      • What if Management action has failed, avoiding loops if a management action has failed.
      • Notifying other services if a service location has changed after recovery.
    • Case Study Based on Large Scale E-Science Project
      • Enable Scientist to find interesting condition from weather data collected across united States, process them using National Computation resources (TeraGrid), and manage weather data, results, and their provenance
      • Build using SOA based architecture, have 13+ persistent services and many services created on demand.
      • Enforces Undefined Management Logic (expressed as rules), and has a global view of the system.
      • Scalable (to manage about 100,1000 services).
      • Robust -(Self-organizing, recovers from failures of both resources and management framework)
      • Dynamic (discover components, keep track when resources join and leave)
      Hasthi Management Framework
    • Proposed Integration Model of Hasthi with a Given System
    • Types of Management Agents
      • Action Types
        • Create a New service
        • Restart a running service or recover a failed service
        • Relocate a service
        • Tune and configure a resource – change the configuration of a resource or change the structure of the system.
        • User Interaction Action
      • Actions implementation:
        • Use shell scripts (e.g. service start or stop) and execute them using a Host Agent running in each host.
        • Use Hasthi Agent integrated with each resource.
        • Hasthi provides default management actions, but users can write their own.
      Management Actions
    • Handling Lost State
      • If Service writes its state to a storage location and exposes the location as a parameter, Hasthi passes that location as a Argument to the new service.
      • Hasthi acts as a Service registry, and helps services to find instances of other dependency services by a lookup. So services can recover other services via the lookup if a dependency service failed or at initialization.
    • Failed Management Actions
      • Resource life cycle avoid Loops
      • User interactions to delegate fixing the error to human users (send a email to user, user responds via clicking a link)
    • Fail Positives
      • Vary Hard Problem, fact of systems.
      • We use heartbeat + timeouts as indicators and trigger (pluggable) failure detectors (e.g. active pings, functional tests).
      • Other Services timeouts can raise a faulty suspect conditions and custom failure detectors are activated.
    • LEAD E-Science Project
      • We confirmed 80-20 rule by analyzing LEAD error data over an 18 months period where 30/80 (37%) different error types were responsible for 95% of all error occurrences.
      • LEAD services write data to a database at once, and has best effort global state (explain).
      • Handling Errors in LEAD
        • Execution Errors – handled by multiple levels of retires (e.g. file transfers / job submission retries,, run executions in different computational resources, part of LEAD).
        • Hasthi handles infrastructure errors, and then recover failed workflows due to those errors.
    • Usecase As Rules
      • Condition and a Action.
      • Failed Recovered Services by restarting or moving (Real Rules can be complicated)
    • Rules: Detect Failed System, and Restart Workflows after failure.
    • Workflow Recovery
      • Hasthi recovers LEAD from services and host failures and recovers failed workflows.
      • A) Killed a service B) killed a host and measured the time to detect, trigger actions, new resources to join, and detect healthy conditions. Take about 2 minutes to recover the system and to know it is healthy.
      Evaluation: LEAD Integration
    • What does results Mean?
      • Assume MTTF of a service is f, and services are independent. Then MTTF of the system is f/26 (by Baumann [8] assume 26 services).
      • Using MTTR from above results, and assuming Hasthi do not fail, Then Availability of the system is
      • That is Availability of 0.995, 0.997, 0.999 with MTTF of 1 week, 2 weeks, 1months per service, which is 46.8, 26.3, and 8.8 hours downtime per year .
    • Demo (If we have time)
      • http://www.extreme.indiana.edu/hasthi/lead/screencasts/hasthi4.htm
    • Questions