Your SlideShare is downloading. ×
0
WorkflowSim: A Toolkit for Simulating  Scientific Workflows in Distributed             Environments         Weiwei Chen, E...
Outline§  Introduction  –  Scientific workflows?  –  Distributed environments?§  Challenge  –  Large scale, system overh...
Scientific Applications§  Scientists often need to  –  Integrate diverse components and data  –  Automate data processing...
Scientific Workflows§  DAG model (Directed Acyclic Graph)  –  Node: computational activities  –  Directed edge: data depe...
Workflow Management System§  Common Features of WMS  –  Maps abstract workflows to executable workflows  –  Handles data ...
Workflow Simulation§  Benefits  –  Save efforts in system setup, execution  –  Repeat experimental results  –  Control sy...
Comparison                          CloudSim           WorkflowSim                         Task, Bag of  Execution Model  ...
System Architecture§  Submit Host  –  Workflow Mapper  –  Clustering Engine  –  Workflow Engine  –  Local Scheduler§  Ex...
Workflow Overhead–  Workflow Engine       –  Postscript Delay   Delay                 –  Clustering Delay–  Queue Delay   ...
Example: Clustering Delay                         n is the number of tasks per                         level. k is number ...
Validation           !"#$%&#$!!"#$%&&!!"#$%&   Ideal Case: Accuracy=1.0!!!"#$!% =             !"#$!!"#$%&&!!"#$%&      k: ...
Application: Overhead Robustness§  Overhead robustness: the influence of    overheads on the workflow runtime for DAG    ...
Application: Overhead Robustness§  Increase or Reduce Overhead by a Factor    (Weight)  –  Under estimation and over esti...
Application: Overhead Robustness                           MCT<MinMin MCT<MinMin         MCT>MinMinMCT<MinMin    Accurate ...
Workflow Failure§  Failures have significant influence on the    performance§  Classifying Transient Failures   –  Task ...
System Architecture§  Submit Host  –  Job Retry  –  Reclustering§  Execution Site  –  Failure Generator  –  Failure Moni...
Application: Fault Tolerant Clustering§  Task clustering can reduce execution overhead§  A job composed of multiple task...
Application: Fault Tolerant Clustering–  Selective Reclustering (SR) retries the failed tasks in a job–  Dynamic Recluster...
Performance                       §  Reclustering (DR/DC/SR) reduces the influence of                           failures ...
Conclusion§  WorkflowSim assists researchers to evaluate    their workflow optimization techniques with    better accurac...
Acknowledgements§    Pegasus Team: http://pegasus.isi.edu§    FutureGrid & XSEDE§    Funded by NSF grants IIS-0905032.§...
Overhead Distribution                   Montage          22
Overhead Distribution                   CyberShake          23
Broadband            24
Task Failure Model and Job Failure Model                n                                       4d      k* =              ...
Related Papers§    Integration of Workflow Partitioning and Resource Provisioning, CCGrid 2012.§    Improving Scientific...
Upcoming SlideShare
Loading in...5
×

Workflowsim escience12

325

Published on

workflow simulator

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
325
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
29
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Workflowsim escience12"

  1. 1. WorkflowSim: A Toolkit for Simulating Scientific Workflows in Distributed Environments Weiwei Chen, Ewa Deelman
  2. 2. Outline§  Introduction –  Scientific workflows? –  Distributed environments?§  Challenge –  Large scale, system overhead§  Solution –  Workflow Overhead and Failure Model§  Validation and Application –  Overhead Robustness –  Fault Tolerant Clustering 2
  3. 3. Scientific Applications§  Scientists often need to –  Integrate diverse components and data –  Automate data processing steps –  Reproduce/analyze/share previous results –  Track the provenance of data products –  Execute processing steps efficiently –  Reliably execute applications Scientific Workflows provide solutions to these problems 3
  4. 4. Scientific Workflows§  DAG model (Directed Acyclic Graph) –  Node: computational activities –  Directed edge: data dependencies –  Task: a process that users would like to execute –  Job: a single unit for execution with one of more tasks –  Task clustering: the process of grouping tasks to jobs 4
  5. 5. Workflow Management System§  Common Features of WMS –  Maps abstract workflows to executable workflows –  Handles data dependencies –  Replica selection, transfers, registration, cleanup –  Task clustering, workflow partitioning, scheduling –  Reliability and fault tolerance –  Monitoring and troubleshooting§  Existing WMS –  Pegasus, Askalon, Taverna, Kepler, Triana 5
  6. 6. Workflow Simulation§  Benefits –  Save efforts in system setup, execution –  Repeat experimental results –  Control system environments (failures)§  Trace based Workflow Simulation –  Import trace from a completed execution –  Vary workflow structures and system environments§  Challenges (CloudSim, GridSim, etc.) –  System overhead and failures –  Multiple levels of computational activities(task/job) –  A hierarchy of management components 6
  7. 7. Comparison CloudSim WorkflowSim Task, Bag of Execution Model Task, Job, DAG TasksFailure and Monitoring No Yes Data Transfer Delay Data Transfer Overhead Model Workflow Engine Delay Delay Clustering Delay … Site selection Single Multiple Scheduling, Job retry Optimization Scheduling Clustering, Techniques Partitioning … WorkflowSim is an extension of CloudSim, but it is workflow aware 7
  8. 8. System Architecture§  Submit Host –  Workflow Mapper –  Clustering Engine –  Workflow Engine –  Local Scheduler§  Execution Site –  Remote Scheduler –  Worker Nodes –  Failure Generator –  Failure Monitor 8
  9. 9. Workflow Overhead–  Workflow Engine –  Postscript Delay Delay –  Clustering Delay–  Queue Delay –  Data Transfer Delay 9
  10. 10. Example: Clustering Delay n is the number of tasks per level. k is number of jobs per level. !"#$%&()*!!"#$%|!!! ! ! ! = = !"#$%&()*!!"#$%|!!! ! ! ! mProjectPP, mDiffFit, and mBackground are the major jobs of MontageOverhead is not a constant variable.It has diverse distribution and patterns 10
  11. 11. Validation !"#$%&#$!!"#$%&&!!"#$%& Ideal Case: Accuracy=1.0!!!"#$!% = !"#$!!"#$%&&!!"#$%& k: Maximum jobs per level Overheads have the biggest impact 11
  12. 12. Application: Overhead Robustness§  Overhead robustness: the influence of overheads on the workflow runtime for DAG scheduling heuristics.§  Inaccurate estimation (under- or over-estimated) of workflow overheads influences the overall runtime of workflows.§  Research Merits: –  Sensitivity of heuristics –  Overhead friendly heuristics 12
  13. 13. Application: Overhead Robustness§  Increase or Reduce Overhead by a Factor (Weight) –  Under estimation and over estimation§  Heuristics Evaluated –  FCFS: First Come First Serve –  MCT: Minimum Completion Time –  MinMin: The job with the minimum completion time is selected and assigned to the fastest resource. –  MaxMin: The job with the maximum completion time and assigns it to its best available resource. 13
  14. 14. Application: Overhead Robustness MCT<MinMin MCT<MinMin MCT>MinMinMCT<MinMin Accurate estimation of Clustering Delay is more important 14
  15. 15. Workflow Failure§  Failures have significant influence on the performance§  Classifying Transient Failures –  Task Failure: A task fails, other tasks within the same job may not fail –  Job Failure: A job fails, all of its tasks fail 15
  16. 16. System Architecture§  Submit Host –  Job Retry –  Reclustering§  Execution Site –  Failure Generator –  Failure Monitor 16
  17. 17. Application: Fault Tolerant Clustering§  Task clustering can reduce execution overhead§  A job composed of multiple tasks may have a greater risk of suffering from failures§  Reclustering and Job Retry are proposed –  No Optimization (NOOP) retries the failed jobs. –  Dynamic Clustering (DC) decreases the clusters.size if the measured job failure rate is high. 17
  18. 18. Application: Fault Tolerant Clustering–  Selective Reclustering (SR) retries the failed tasks in a job–  Dynamic Reclustering (DR) retires the failed tasks in a job and also decreases the clusters.size if the measured job failure rate is high. 18
  19. 19. Performance §  Reclustering (DR/DC/SR) reduces the influence of failures significantly compared to NOOP §  DR outperforms other techniques.The lower the better 19
  20. 20. Conclusion§  WorkflowSim assists researchers to evaluate their workflow optimization techniques with better accuracy and wider support. –  Modeling Overhead and Failures –  Distributed and Hierarchical components –  Workflow Techniques§  It is necessary to consider both data dependencies, workflow failures and system overhead. 20
  21. 21. Acknowledgements§  Pegasus Team: http://pegasus.isi.edu§  FutureGrid & XSEDE§  Funded by NSF grants IIS-0905032.§  More info: wchen@isi.edu§  Available on request§  Q&A 21
  22. 22. Overhead Distribution Montage 22
  23. 23. Overhead Distribution CyberShake 23
  24. 24. Broadband 24
  25. 25. Task Failure Model and Job Failure Model n 4d k* = −d + d 2 − r ln(1− α ) k* = , if n >> r 2t * (kt + d) ttotal = * n(k *t + d) 1− β t = total * rk(1− α )k 25 25
  26. 26. Related Papers§  Integration of Workflow Partitioning and Resource Provisioning, CCGrid 2012.§  Improving Scientific Workflow Performance using Policy Based Data Placement, IEEE Policy 2012.§  Fault Tolerant Clustering in Scientific Workflows, SWF, IEEE Services 2012§  Workflow Overhead Analysis and Optimizations, WORKS 2011.§  Partitioning and Scheduling Workflows across Multiple Sites with Storage Constraints, PPAM 2011§  Pegasus: a Framework for Mapping Complex Scientific Workflows onto Distributed Systems, Scientific Programming Journal 2005 26
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×