On-line, non-clairvoyant optimization of workflow activity granularity task on grids

  • 112 views
Uploaded on

Presentation held at Euro-Par 2013, Aachen, Germany …

Presentation held at Euro-Par 2013, Aachen, Germany

Abstract. Controlling the granularity of workflow activities executed on widely distributed computing platforms such as grids is required to reduce the impact of task queuing and data transfer time. Most existing granularity control approaches assume extensive knowledge about the applications and resources (e.g. task duration on each resource), and that both the workload and available resources do not change over time. We propose a granularity control algorithm for platforms where such clairvoyant and offline conditions are not realistic. Our method groups tasks when the fineness degree of the application, which takes into account the ratio of shared data and the queuing/round-trip time ratio, becomes higher than a threshold determined from execution traces. The algorithm also de-groups task groups when new resources arrive. The application's behavior is constantly monitored so that the characteristics useful for the optimization are progressively discovered. Experimental results, obtained with 3 workflow activities deployed on the European Grid Infrastructure, show that (i) the grouping process yields speed-ups of about 2.5 when the amount of available resources is constant and that (ii) the use of de-grouping yields speed-ups of 2 when resources progressively appear.

More information: www.rafaelsilva.com

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
112
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 1 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr On-line, Non-Clairvoyant Optimization of Workflow Activity Granularity on Grids Rafael FERREIRA DA SILVA, Tristan GLATARD University of Lyon, CNRS, INSERM, CREATIS Villeurbanne, France Frédéric DESPREZ INRIA, University of Lyon, LIP, ENS Lyon Lyon, France Euro-Par 2013 August 26-30, 2013
  • 2. Outline   Context   The Virtual Imaging Platform   Problem definition   Task granularity   Self-healing of workflow executions on grids   Task granularity control process   Experiments and results   Conclusion 2 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
  • 3. Outline   Context   The Virtual Imaging Platform   Problem definition   Task granularity   Self-healing of workflow executions on grids   Task granularity control process   Experiments and results   Conclusion 3 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
  • 4. Context   Virtual Imaging Platform (VIP)   Medical imaging science-gateway   Grid of ~180 sites (EGI – http://www.egi.eu)   Significant usage   452 registered users from 50 countries   Consumed 472 CPU years from August 2012 to July 2013 http://dirac.france-grilles.fr 4 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr VIP consumption since August 2012
  • 5. Workflow Execution Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr 2. User launches a simulation 3. MOTEUR generates invocations 4. GASW generates grid jobs 5. Jobs are submitted to DIRAC 6. Pilot jobs are submitted to EGI 1. Input data upload 7. Pilot jobs fetch grid jobs 8. Inputs download 10. Results upload 11. Download results 9. Execution 5
  • 6.   Low performance of lightweight (a.k.a. fine-grained) tasks:   High queuing times   Communication overhead Task Granularity 6 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr time R1 R2 R3 t1 t2 t3 t4 t5 t1 t2 t3 t4 t5 Resources lightweight tasks Lightweight task executions are delayed Group into coarse-grained tasks reduces the cost of data transfers when grouped tasks share input data, and saves queuing time
  • 7. Workflow Self-Healing 7 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr   Problem: costly manual operations   Rescheduling tasks, restarting services or replicating data files   In this work: task granularity in distributed workflows   Objective: automated platform administration   Autonomous detection of fine-grained tasks   Perform appropriate set of actions   Assumptions: online and non-clairvoyant   Only partial information available   Decisions must be fast   Production conditions, no user activity and workloads prediction
  • 8. General MAPE-K loop 8 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr Incident 1 degree η = 0.8 Incident 2 degree η = 0.4 Incident 3 degree η = 0.1 level 1 level 2 level 3 Roulette wheel selection Incident 1 Selected Rule Confidence (ρ) ρxη 2 1 0.8 0.32 3  1 0.2 0.02 1  1 1.0 0.80 Association rules for incident 1 Incident 2 Selected Roulette wheel selection based on association rules Set of Actions x2 level 1 level 2 level 3 level 1 level 2 level 3 € = ηi ηjj=1 n ∑ event (job completion and failures) or timeout Monitoring Analysis Execution Knowledge Planning Monitoring data R. Ferreira da Silva, T. Glatard, F. Desprez, Self-healing of workflow activity incidents on distributed computing infrastructures, Future Generation Computer Systems (FGCS), in press, 2013.
  • 9.   Incident degrees are quantified in discrete incident levels   Thresholds are determined from visual mode clustering or K-means Incident Levels and Actions 9 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr No actions are triggered Triggers a set of actions Thresholds cluster platform configurations into groups
  • 10. Outline   Context   The Virtual Imaging Platform   Problem definition   Task granularity   Self-healing of workflow executions on grids   Task granularity control process   Experiments and results   Conclusion 10 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
  • 11.   Task execution   Incident degree Fineness control: degree 11 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr € ηf = maxi∈[1,m]{ fi = di ⋅ ri} € di = t ~ _ shared t ~ _ shared + ni (t ~ − t ~ _ shared ) € ri = max j∈[1,ni ] qj max j∈[1,ni ] qj + t ~ _ shared + ni(t ~ − t ~ _ shared ) Queued Time Shared Input Data Other Input Data Application Execution € t ~ _ shared € t € qj Median task phase durations i = waiting task n = number of waiting tasks
  • 12. Fineness control: task estimation   Estimation of task durations   Job phases: setup  inputs download  execution  outputs upload   Assumption: bag of tasks (all jobs have equal durations)   Median-based estimation: 12 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr Median duration of jobs phases Real job duration 42s 300s 20s ? 42s 300s 400s* 15s Estimated job duration 50s 250s 400s 15s completed current *: max(400s, 20s) = 400s € t ~ = 715s € t ~ i = 757s
  • 13. Fineness control: levels and actions 13 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr   Levels: identified from the platform logs   Actions   Task grouping   Grouped pairwise until or the amount of waiting groups Q is smaller or equal to the amount of running groups R € τf Level 1 (no actions) Level 2 action: task grouping € ηf ≤ τ f
  • 14.   Levels  Incident degree Coarseness control 14 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr € ηc = R Q + R € τc = 0.5 time R1 R2 R3 t1 t2 t3 t4 t5 t1 t2+t3 t4+t5 Resources Tasks at t1 t2+t3 t4+t5 Loss of parallelism   Non-stationary load   Loss of parallelism   Task-degrouping t1 t2 Grouped tasks at t2 De-group tasks when R > Q
  • 15. Workload for Case Studies   Based on the workload of VIP   January 2011 to April 2012   Case Studies on:   Pilot Jobs   User accounting   Task analysis   Bag of tasks   Workflows 112 users 2,941 workflow executions 680,988 tasks 338,989 completed 138,480 error 105,488 aborted 15,576 aborted replicas 48,293 stalled 34,162 queued 339,545 pilot jobs 15 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr R. Ferreira da Silva, T. Glatard, A science-gateway workload archive to study pilot jobs, user activity, bag of tasks, task sub-steps, and workflow executionss, CoreGRID/ERCIM Workshop on Grids, Clouds and P2P Computing (CGWS), Rhodes Island, Greece, 2012.
  • 16. Outline   Context   The Virtual Imaging Platform   Problem definition   Task granularity   Self-healing of workflow executions on grids   Task granularity control process   Experiments and results   Conclusion 16 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
  • 17. Experiment Conditions 17 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr   Experiment 1   Evaluate the fineness control process under stationary load   Experiment 2   Evaluate the de-grouping control process under non-stationary load   Workflows characteristics
  • 18. 18 Results: stationary load 18 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr Fineness yields significant makespan reduction for all repetitions
  • 19. 19 Results: stationary load (2) 19 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr Task grouping speed-ups SimuBloch and FIELD-II up to a factor of 2.6, and PET-SORTEO/emission up to a factor of 2.5 Not able to group all SimuBloch tasks in a single group because 2 tasks must be completed for the task estimation process
  • 20. 20 Results: non-stationary load 20 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr Resources appear progressively Resources appear suddenly Speeds up executions up to a factor of 1.5 for Fineness, and 2.1 for Fineness-Coarseness Fineness is penalized by its lack of adaptation: slowdown of 20%
  • 21. 21 Results: non-stationary load (2) 21 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr Linear correlation coefficient between the makespan and the average queuing time is 0.91, which indicates they are correlated
  • 22. Outline   Context   The Virtual Imaging Platform   Problem definition   Task granularity   Self-healing of workflow executions on grids   Task granularity control process   Experiments and results   Conclusion 22 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
  • 23. Concluding remarks 23 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr   Context   Autonomous handling of unfairness among workflow executions   No strong assumptions on resource characteristics and workload   Summary of the proposed method   Implements a generic MAPE-K loop   Determines task fineness based on queue waiting time and estimated data transfer time of shared input data   Tasks are grouped pairwise as long as Q > R, and tasks are too fine   Tasks are ungrouped when the number of available resources increases   Optimizing task granularity   Properly detects and handles lightweight tasks   Stationary load: fineness control significantly reduces the makespan of all applications   Non-stationary load: de-grouping algorithm compensates lack of adaptation of task grouping
  • 24. Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr Thank you for your attention. Questions? Rafael FERREIRA DA SILVA, Tristan GLATARD University of Lyon, CNRS, INSERM, CREATIS Villeurbanne, France Frédéric DESPREZ INRIA, University of Lyon, LIP, ENS Lyon Lyon, France On-line, Non-Clairvoyant Optimization of Workflow Activity Granularity on Grids Acknowledgments: VIP users and project members French National Agency for Research (ANR-09-COSI-03, ANR-11-LABX-0063) EC FP7 Programme (312579 ER-flow) European Grid Initiative (EGI) France-Grilles