A science-gateway for workflow executions: online and non-clairvoyant self-healing of workflow executions on grids

892 views

Published on

PhD Thesis presented on November 29th 2013 at INSA-Lyon

Abstract - Science gateways, such as the Virtual Imaging Platform (VIP), enable transparent access to distributed computing and storage resources for scientific computations. However, their large scale and the number of middleware systems involved lead to many errors and faults. In practice, science gateways are often backed by substantial support staff who monitors running experiments by performing simple yet crucial actions such as rescheduling tasks, restarting services, killing misbehaving runs or replicating data files to reliable storage facilities. Fair quality of service (QoS) can then be delivered, yet with important human intervention. Automating such operations is challenging for two reasons. First, the problem is online by nature because no reliable user activity prediction can be assumed, and new workloads may arrive at any time. Therefore, the considered metrics, decisions and actions have to remain simple and to yield results while the application is still executing. Second, it is non-clairvoyant due to the lack of information about applications and resources in production conditions. Computing resources are usually dynamically provisioned from heterogeneous clusters, clouds or desktop grids without any reliable estimate of their availability and characteristics. Models of application execution times are hardly available either, in particular on heterogeneous computing resources. In this thesis, we propose a general healing process for autonomous detection and handling of operational incidents in workflow executions. Instances are modeled as Fuzzy Finite State Machines (FuSM) where state degrees of membership are determined by an external healing process. Degrees of membership are computed from metrics assuming that incidents have outlier performance, e.g. a site or a particular invocation behaves differently than the others. Based on incident degrees, the healing process identifies incident levels using thresholds determined from the platform history. A specific set of actions is then selected from association rules among incident levels.

For more information visit http://www.rafaelsilva.com

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
892
On SlideShare
0
From Embeds
0
Number of Embeds
442
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

A science-gateway for workflow executions: online and non-clairvoyant self-healing of workflow executions on grids

  1. 1. A science-gateway for workflow executions: online and non-clairvoyant self-healing of workflow executions on grids Rafael FERREIRA DA SILVA University of Lyon, CNRS, INSERM, CREATIS Villeurbanne, France Supervisors: Frédéric DESPREZ and Tristan GLATARD This work was funded by the French National Agency for Research 1under grant ANR-09-COSI-03 "VIP”
  2. 2. Outline —  Technical context and challenges —  Contributions —  —  —  —  Self-healing of workflow executions on grids Treatment of blocked activities Optimization of task granularity Fairness control among workflow executions —  Conclusions 2
  3. 3. Outline —  Technical context and challenges —  Contributions —  —  —  —  Self-healing of workflow executions on grids Treatment of blocked activities Optimization of task granularity Fairness control among workflow executions —  Conclusions 3
  4. 4. Heavy Medical Simulations Treatement planning for prostate protontherapy [L. Grevillot, D. Sarrut] Medical-Imaging Execution Platform 491 users from 52 countries CPU Time: 2 months Virtual Imaging Platform Simulated diffusion weighted images [L. Wang, Y. Zhu, I. Magnin] CPU Time: 8 years Echography simulation [O. Bernard, M. Alessandrini] CPU Time: 42 hours 4 Public Computing Infrastructure 150 computing sites world-wide Goal: Self-healing of workflow executions on grids to handle operational issues
  5. 5. 2. User launches a simulation (application workflow) 1. Input data upload 11. Download results Science-Gateway 8. Inputs download 9. Execution 10. Results upload 5 Virtual Imaging Platform (VIP) Workflow Execution 3. Workflow engine generates invocations 4. Invocations are wrapped into grid jobs High-level interface Software-as-a-Service 5. Jobs are submitted to a Pilot Engine 6. Pilot jobs are submitted to the distributed infrastructure 7. Pilot jobs fetch grid jobs
  6. 6. Workflow Execution 2. User launches a simulation (application workflow) 3. Workflow engine generates invocations 4. Invocations are wrapped into grid jobs 1. Input data upload 11. Download results Workflow Management 8. Inputs download System Applications described as workflows Parallel language Grid-aware enactor 9. Execution 5. Jobs are submitted to a Pilot Engine 6. Pilot jobs are submitted to the distributed infrastructure 10. Results upload 6 7. Pilot jobs fetch grid jobs
  7. 7. Workflow Execution 2. User launches a simulation (application workflow) 3. Workflow engine generates invocations 4. Invocations are wrapped into grid jobs 1. Input data upload 11. Download results 8. Inputs download Workload Management System 9. Execution Pilot jobs run special agents that fetch user tasks from the task queue, set up their environment and steer their execution 10. Results upload 7 5. Jobs are submitted to a Pilot Engine 6. Pilot jobs are submitted to the distributed infrastructure 7. Pilot jobs fetch grid jobs
  8. 8. Workflow Execution 2. User launches a simulation (application workflow) 3. Workflow engine generates invocations 4. Invocations are wrapped into grid jobs 1. Input data upload 11. Download results European Grid Infrastructure (EGI) +100 computing sites +25,000 job slots ~4PB of Storage 8. Inputs download 9. Execution 10. Results upload 8 5. Jobs are submitted to a Pilot Engine 6. Pilot jobs are submitted to the distributed infrastructure 7. Pilot jobs fetch grid jobs
  9. 9. Challenges —  Several workflow execution errors Average workflow completion rate is about 60% Number of launched and completed workflow in VIP from Jan to Dec 2012 —  Several dysfunctional and performance problems —  Requires manual interventions —  Problem: costly manual operations —  e.g.: rescheduling tasks, restarting services, killing misbehaving experiments, or replicating data files 9
  10. 10. Objectives —  Objective: Automated platform administration —  Autonomous detection of operational incidents —  Perform appropriate set of actions —  Assumptions: Online and non-clairvoyant —  —  —  —  10 Decisions must be fast No information about tasks (duration, data transfer time, etc.) No information about resources (availability, performance, etc.) No user activity and workloads prediction
  11. 11. Outline —  Technical context and challenges —  Contributions —  —  —  —  Self-healing of workflow executions on grids Treatment of blocked activities Optimization of task granularity Fairness control among workflow executions —  Conclusions 11
  12. 12. State of the Art —  Self-healing of workflow executions —  Most works from the literature are offline and/or clairvoyant —  Common techniques to address operational incidents —  Task resubmission —  [Kandaswamy et al., 2008], [Zhang et al., 2009], [Montagnat et al., 2010] —  Task and file replication —  [Cirne et al., 2007], [Ben-Yehuda et al., 2012], [Ma et al., 2013] —  Task grouping —  [Muthuvelu et al., 2005-2013], [Lie and Liao, 2009], [Chen et al., 2013] —  Heuristics to fairly schedule workflow tasks —  [Zhao and Sakellariou, 2006], [N’Takpe and Suter, 2009], [Casanova et al., 2010] 12
  13. 13. Fuzzy Finite State Machine —  The healing process sets the degree of FuSM states from incident Crisp states detection metrics Possible values: 0 or 1 Fuzzy states Values between 0 and 1 13
  14. 14. General MAPE-K loop event (job completion and failures) or timeout Incident 1 degree η = 0.8 level 1 level 2 Incident 2 degree η = 0.4 Incident 3 degree η = 0.1 level 1 level 1 level 3 level 2 level 3 level 2 level 3 Monitoring Analysis Set of Actions 6e+04 = 0e+00 x2 Frequency Monitoring data 0.0 Execution 0.2 0.4 0.6 0.8 ∑ n ηj j =1 1.0 ηu Knowledge Roulette wheel selection € Planning Rule Confidence (ρ) ρxη Selected 2è 1 0.8 0.32 Selected Incident 2 3 è 1 0.2 0.02 Incident 1 1 è 1 1.0 0.80 Roulette wheel selection based on association rules 14 ηi Association rules for incident 1 R. Ferreira da Silva, T. Glatard, F. Desprez, Self-healing of workflow activity incidents on

×