SlideShare a Scribd company logo
WorkflowSim: A Toolkit for Simulating
  Scientific Workflows in Distributed
             Environments
         Weiwei Chen, Ewa Deelman
Outline
§  Introduction
  –  Scientific workflows?
  –  Distributed environments?
§  Challenge
  –  Large scale, system overhead
§  Solution
  –  Workflow Overhead and Failure Model
§  Validation and Application
  –  Overhead Robustness
  –  Fault Tolerant Clustering

                           2
Scientific Applications
§  Scientists often need to
  –  Integrate diverse components and data
  –  Automate data processing steps
  –  Reproduce/analyze/share previous results
  –  Track the provenance of data products
  –  Execute processing steps efficiently
  –  Reliably execute applications

       Scientific Workflows provide
       solutions to these problems


                         3
Scientific Workflows
§  DAG model (Directed Acyclic Graph)
  –  Node: computational activities
  –  Directed edge: data dependencies
  –  Task: a process that users would like to execute
  –  Job: a single unit for execution with one of more tasks
  –  Task clustering: the process of grouping tasks to jobs




                             4
Workflow Management System
§  Common Features of WMS
  –  Maps abstract workflows to executable workflows
  –  Handles data dependencies
  –  Replica selection, transfers, registration, cleanup
  –  Task clustering, workflow partitioning, scheduling
  –  Reliability and fault tolerance
  –  Monitoring and troubleshooting
§  Existing WMS
  –  Pegasus, Askalon, Taverna, Kepler, Triana



                           5
Workflow Simulation
§  Benefits
  –  Save efforts in system setup, execution
  –  Repeat experimental results
  –  Control system environments (failures)
§  Trace based Workflow Simulation
  –  Import trace from a completed execution
  –  Vary workflow structures and system environments
§  Challenges (CloudSim, GridSim, etc.)
  –  System overhead and failures
  –  Multiple levels of computational activities(task/job)
  –  A hierarchy of management components
                           6
Comparison
                          CloudSim           WorkflowSim
                         Task, Bag of
  Execution Model                           Task, Job, DAG
                            Tasks
Failure and Monitoring        No                 Yes

                                          Data Transfer Delay
                         Data Transfer
  Overhead Model                         Workflow Engine Delay
                            Delay
                                          Clustering Delay …

    Site selection          Single             Multiple
                                         Scheduling, Job retry
    Optimization
                          Scheduling         Clustering,
    Techniques
                                            Partitioning …
     WorkflowSim is an extension of
     CloudSim, but it is workflow aware
                             7
System Architecture
§  Submit Host
  –  Workflow Mapper
  –  Clustering Engine
  –  Workflow Engine
  –  Local Scheduler
§  Execution Site
  –  Remote Scheduler
  –  Worker Nodes
  –  Failure Generator
  –  Failure Monitor

                         8
Workflow Overhead
–  Workflow Engine       –  Postscript Delay
   Delay                 –  Clustering Delay
–  Queue Delay           –  Data Transfer Delay




                     9
Example: Clustering Delay
                         n is the number of tasks per
                         level. k is number of jobs per
                         level.
                          !"#$%&'()*!!"#$%|!!! ! ! !
                                              =   =
                          !"#$%&'()*!!"#$%|!!! ! ! !



                         mProjectPP, mDiffFit, and
                         mBackground are the major
                         jobs of Montage

Overhead is not a constant variable.
It has diverse distribution and patterns

                 10
Validation
           !"#$%&'#$!!"#$%&&!!"#$%&'   Ideal Case: Accuracy=1.0
!!!"#$!% =
             !"#$!!"#$%&&!!"#$%&'      k: Maximum jobs per level




                                        Overheads have
                                        the biggest impact




                               11
Application: Overhead Robustness
§  Overhead robustness: the influence of
    overheads on the workflow runtime for DAG
    scheduling heuristics.
§  Inaccurate estimation (under- or over-estimated)
    of workflow overheads influences the overall
    runtime of workflows.
§  Research Merits:
     –  Sensitivity of heuristics
     –  Overhead friendly heuristics



                         12
Application: Overhead Robustness
§  Increase or Reduce Overhead by a Factor
    (Weight)
  –  Under estimation and over estimation
§  Heuristics Evaluated
  –  FCFS: First Come First Serve
  –  MCT: Minimum Completion Time
  –  MinMin: The job with the minimum completion time is
     selected and assigned to the fastest resource.
  –  MaxMin: The job with the maximum completion time
     and assigns it to its best available resource.



                           13
Application: Overhead Robustness
                           MCT<MinMin MCT<MinMin
         MCT>MinMin

MCT<MinMin




    Accurate estimation of Clustering
    Delay is more important

                      14
Workflow Failure
§  Failures have significant influence on the
    performance
§  Classifying Transient Failures
   –  Task Failure: A task fails, other tasks within the
      same job may not fail
   –  Job Failure: A job fails, all of its tasks fail




                            15
System Architecture
§  Submit Host
  –  Job Retry
  –  Reclustering
§  Execution Site
  –  Failure Generator
  –  Failure Monitor




                         16
Application: Fault Tolerant Clustering
§  Task clustering can reduce execution overhead
§  A job composed of multiple tasks may have a
    greater risk of suffering from failures
§  Reclustering and Job Retry are proposed
  –  No Optimization (NOOP) retries the failed jobs.
  –  Dynamic Clustering (DC) decreases the clusters.size if
     the measured job failure rate is high.




                           17
Application: Fault Tolerant Clustering
–  Selective Reclustering (SR) retries the failed tasks in a job
–  Dynamic Reclustering (DR) retires the failed tasks in a job and also
   decreases the clusters.size if the measured job failure rate is high.




                               18
Performance
                       §  Reclustering (DR/DC/SR) reduces the influence of
                           failures significantly compared to NOOP
                       §  DR outperforms other techniques.
The lower the better




                                                 19
Conclusion
§  WorkflowSim assists researchers to evaluate
    their workflow optimization techniques with
    better accuracy and wider support.
  –  Modeling Overhead and Failures
  –  Distributed and Hierarchical components
  –  Workflow Techniques
§  It is necessary to consider both data
    dependencies, workflow failures and system
    overhead.



                           20
Acknowledgements
§    Pegasus Team: http://pegasus.isi.edu
§    FutureGrid & XSEDE
§    Funded by NSF grants IIS-0905032.
§    More info: wchen@isi.edu
§    Available on request
§    Q&A




                               21
Overhead Distribution




                   Montage




          22
Overhead Distribution




                   CyberShake




          23
Broadband




            24
Task Failure Model and Job Failure Model




                n                                       4d
      k* =                           −d + d 2 −
                r                                    ln(1− α )
                              k* =                               ,   if   n >> r
                                                2t
     *       (kt + d)
    ttotal
        =                    *   n(k *t + d)
              1− β           t =
                             total          *
                                 rk(1− α )k




                        25                               25
Related Papers


§    Integration of Workflow Partitioning and Resource Provisioning, CCGrid 2012.
§    Improving Scientific Workflow Performance using Policy Based Data
      Placement, IEEE Policy 2012.
§    Fault Tolerant Clustering in Scientific Workflows, SWF, IEEE Services 2012
§    Workflow Overhead Analysis and Optimizations, WORKS 2011.
§    Partitioning and Scheduling Workflows across Multiple Sites with Storage
      Constraints, PPAM 2011
§    Pegasus: a Framework for Mapping Complex Scientific Workflows onto
      Distributed Systems, Scientific Programming Journal 2005




                                          26

More Related Content

What's hot

Clustering: Méthode hiérarchique
Clustering: Méthode hiérarchiqueClustering: Méthode hiérarchique
Clustering: Méthode hiérarchique
Yassine Mhadhbi
 
SSAS 2012 : Multidimensionnel et tabulaire au banc d'essai
SSAS 2012 : Multidimensionnel et tabulaire au banc d'essaiSSAS 2012 : Multidimensionnel et tabulaire au banc d'essai
SSAS 2012 : Multidimensionnel et tabulaire au banc d'essai
Microsoft Technet France
 
Tears in Heaven - Eric Clapton
Tears in Heaven - Eric ClaptonTears in Heaven - Eric Clapton
Tears in Heaven - Eric Clapton
david bonnin
 
excel pour mur en beton arme (dimensionnement) - télécharger l'excel : http:/...
excel pour mur en beton arme (dimensionnement) - télécharger l'excel : http:/...excel pour mur en beton arme (dimensionnement) - télécharger l'excel : http:/...
excel pour mur en beton arme (dimensionnement) - télécharger l'excel : http:/...
Hani sami joga
 
Detection de fraude
Detection de fraudeDetection de fraude
Detection de fraude
Dimedrik Vanil FEUDJIEU
 
Tp Sql Server Integration Services 2008
Tp  Sql Server Integration Services  2008Tp  Sql Server Integration Services  2008
Tp Sql Server Integration Services 2008
Abdelouahed Abdou
 

What's hot (7)

Clustering: Méthode hiérarchique
Clustering: Méthode hiérarchiqueClustering: Méthode hiérarchique
Clustering: Méthode hiérarchique
 
SSAS 2012 : Multidimensionnel et tabulaire au banc d'essai
SSAS 2012 : Multidimensionnel et tabulaire au banc d'essaiSSAS 2012 : Multidimensionnel et tabulaire au banc d'essai
SSAS 2012 : Multidimensionnel et tabulaire au banc d'essai
 
Tears in Heaven - Eric Clapton
Tears in Heaven - Eric ClaptonTears in Heaven - Eric Clapton
Tears in Heaven - Eric Clapton
 
excel pour mur en beton arme (dimensionnement) - télécharger l'excel : http:/...
excel pour mur en beton arme (dimensionnement) - télécharger l'excel : http:/...excel pour mur en beton arme (dimensionnement) - télécharger l'excel : http:/...
excel pour mur en beton arme (dimensionnement) - télécharger l'excel : http:/...
 
Detection de fraude
Detection de fraudeDetection de fraude
Detection de fraude
 
Tp Sql Server Integration Services 2008
Tp  Sql Server Integration Services  2008Tp  Sql Server Integration Services  2008
Tp Sql Server Integration Services 2008
 
Np complete
Np completeNp complete
Np complete
 

Similar to Workflowsim escience12

High Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of ViewHigh Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of View
aragozin
 
Взгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCВзгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPC
Olga Lavrentieva
 
Velocity 2018 preetha appan final
Velocity 2018   preetha appan finalVelocity 2018   preetha appan final
Velocity 2018 preetha appan final
preethaappan
 
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
DataWorks Summit
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
Steve Min
 
Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)
camunda services GmbH
 
IEEE CLOUD \'11
IEEE CLOUD \'11IEEE CLOUD \'11
IEEE CLOUD \'11
David Ribeiro Alves
 
apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...
apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...
apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...
apidays
 
Java scalability considerations yogesh deshpande
Java scalability considerations   yogesh deshpandeJava scalability considerations   yogesh deshpande
Java scalability considerations yogesh deshpande
IndicThreads
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Hortonworks
 
NoSQL and ACID
NoSQL and ACIDNoSQL and ACID
NoSQL and ACID
FoundationDB
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
Revolution Analytics
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
DataWorks Summit
 
OpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomOpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroom
Facultad de Informática UCM
 
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...
Nane Kratzke
 
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster RecoveryStop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
DoKC
 
Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]
Lu Wei
 
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010
Cloudera, Inc.
 
Designing apps for resiliency
Designing apps for resiliencyDesigning apps for resiliency
Designing apps for resiliency
Masashi Narumoto
 
Single Page Applications with AngularJS 2.0
Single Page Applications with AngularJS 2.0 Single Page Applications with AngularJS 2.0
Single Page Applications with AngularJS 2.0
Sumanth Chinthagunta
 

Similar to Workflowsim escience12 (20)

High Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of ViewHigh Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of View
 
Взгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCВзгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPC
 
Velocity 2018 preetha appan final
Velocity 2018   preetha appan finalVelocity 2018   preetha appan final
Velocity 2018 preetha appan final
 
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
 
Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)
 
IEEE CLOUD \'11
IEEE CLOUD \'11IEEE CLOUD \'11
IEEE CLOUD \'11
 
apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...
apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...
apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...
 
Java scalability considerations yogesh deshpande
Java scalability considerations   yogesh deshpandeJava scalability considerations   yogesh deshpande
Java scalability considerations yogesh deshpande
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
NoSQL and ACID
NoSQL and ACIDNoSQL and ACID
NoSQL and ACID
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
 
OpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomOpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroom
 
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...
 
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster RecoveryStop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
 
Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]
 
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010
 
Designing apps for resiliency
Designing apps for resiliencyDesigning apps for resiliency
Designing apps for resiliency
 
Single Page Applications with AngularJS 2.0
Single Page Applications with AngularJS 2.0 Single Page Applications with AngularJS 2.0
Single Page Applications with AngularJS 2.0
 

Recently uploaded

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
ViralQR
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 

Workflowsim escience12

  • 1. WorkflowSim: A Toolkit for Simulating Scientific Workflows in Distributed Environments Weiwei Chen, Ewa Deelman
  • 2. Outline §  Introduction –  Scientific workflows? –  Distributed environments? §  Challenge –  Large scale, system overhead §  Solution –  Workflow Overhead and Failure Model §  Validation and Application –  Overhead Robustness –  Fault Tolerant Clustering 2
  • 3. Scientific Applications §  Scientists often need to –  Integrate diverse components and data –  Automate data processing steps –  Reproduce/analyze/share previous results –  Track the provenance of data products –  Execute processing steps efficiently –  Reliably execute applications Scientific Workflows provide solutions to these problems 3
  • 4. Scientific Workflows §  DAG model (Directed Acyclic Graph) –  Node: computational activities –  Directed edge: data dependencies –  Task: a process that users would like to execute –  Job: a single unit for execution with one of more tasks –  Task clustering: the process of grouping tasks to jobs 4
  • 5. Workflow Management System §  Common Features of WMS –  Maps abstract workflows to executable workflows –  Handles data dependencies –  Replica selection, transfers, registration, cleanup –  Task clustering, workflow partitioning, scheduling –  Reliability and fault tolerance –  Monitoring and troubleshooting §  Existing WMS –  Pegasus, Askalon, Taverna, Kepler, Triana 5
  • 6. Workflow Simulation §  Benefits –  Save efforts in system setup, execution –  Repeat experimental results –  Control system environments (failures) §  Trace based Workflow Simulation –  Import trace from a completed execution –  Vary workflow structures and system environments §  Challenges (CloudSim, GridSim, etc.) –  System overhead and failures –  Multiple levels of computational activities(task/job) –  A hierarchy of management components 6
  • 7. Comparison CloudSim WorkflowSim Task, Bag of Execution Model Task, Job, DAG Tasks Failure and Monitoring No Yes Data Transfer Delay Data Transfer Overhead Model Workflow Engine Delay Delay Clustering Delay … Site selection Single Multiple Scheduling, Job retry Optimization Scheduling Clustering, Techniques Partitioning … WorkflowSim is an extension of CloudSim, but it is workflow aware 7
  • 8. System Architecture §  Submit Host –  Workflow Mapper –  Clustering Engine –  Workflow Engine –  Local Scheduler §  Execution Site –  Remote Scheduler –  Worker Nodes –  Failure Generator –  Failure Monitor 8
  • 9. Workflow Overhead –  Workflow Engine –  Postscript Delay Delay –  Clustering Delay –  Queue Delay –  Data Transfer Delay 9
  • 10. Example: Clustering Delay n is the number of tasks per level. k is number of jobs per level. !"#$%&'()*!!"#$%|!!! ! ! ! = = !"#$%&'()*!!"#$%|!!! ! ! ! mProjectPP, mDiffFit, and mBackground are the major jobs of Montage Overhead is not a constant variable. It has diverse distribution and patterns 10
  • 11. Validation !"#$%&'#$!!"#$%&&!!"#$%&' Ideal Case: Accuracy=1.0 !!!"#$!% = !"#$!!"#$%&&!!"#$%&' k: Maximum jobs per level Overheads have the biggest impact 11
  • 12. Application: Overhead Robustness §  Overhead robustness: the influence of overheads on the workflow runtime for DAG scheduling heuristics. §  Inaccurate estimation (under- or over-estimated) of workflow overheads influences the overall runtime of workflows. §  Research Merits: –  Sensitivity of heuristics –  Overhead friendly heuristics 12
  • 13. Application: Overhead Robustness §  Increase or Reduce Overhead by a Factor (Weight) –  Under estimation and over estimation §  Heuristics Evaluated –  FCFS: First Come First Serve –  MCT: Minimum Completion Time –  MinMin: The job with the minimum completion time is selected and assigned to the fastest resource. –  MaxMin: The job with the maximum completion time and assigns it to its best available resource. 13
  • 14. Application: Overhead Robustness MCT<MinMin MCT<MinMin MCT>MinMin MCT<MinMin Accurate estimation of Clustering Delay is more important 14
  • 15. Workflow Failure §  Failures have significant influence on the performance §  Classifying Transient Failures –  Task Failure: A task fails, other tasks within the same job may not fail –  Job Failure: A job fails, all of its tasks fail 15
  • 16. System Architecture §  Submit Host –  Job Retry –  Reclustering §  Execution Site –  Failure Generator –  Failure Monitor 16
  • 17. Application: Fault Tolerant Clustering §  Task clustering can reduce execution overhead §  A job composed of multiple tasks may have a greater risk of suffering from failures §  Reclustering and Job Retry are proposed –  No Optimization (NOOP) retries the failed jobs. –  Dynamic Clustering (DC) decreases the clusters.size if the measured job failure rate is high. 17
  • 18. Application: Fault Tolerant Clustering –  Selective Reclustering (SR) retries the failed tasks in a job –  Dynamic Reclustering (DR) retires the failed tasks in a job and also decreases the clusters.size if the measured job failure rate is high. 18
  • 19. Performance §  Reclustering (DR/DC/SR) reduces the influence of failures significantly compared to NOOP §  DR outperforms other techniques. The lower the better 19
  • 20. Conclusion §  WorkflowSim assists researchers to evaluate their workflow optimization techniques with better accuracy and wider support. –  Modeling Overhead and Failures –  Distributed and Hierarchical components –  Workflow Techniques §  It is necessary to consider both data dependencies, workflow failures and system overhead. 20
  • 21. Acknowledgements §  Pegasus Team: http://pegasus.isi.edu §  FutureGrid & XSEDE §  Funded by NSF grants IIS-0905032. §  More info: wchen@isi.edu §  Available on request §  Q&A 21
  • 22. Overhead Distribution Montage 22
  • 23. Overhead Distribution CyberShake 23
  • 24. Broadband 24
  • 25. Task Failure Model and Job Failure Model n 4d k* = −d + d 2 − r ln(1− α ) k* = , if n >> r 2t * (kt + d) ttotal = * n(k *t + d) 1− β t = total * rk(1− α )k 25 25
  • 26. Related Papers §  Integration of Workflow Partitioning and Resource Provisioning, CCGrid 2012. §  Improving Scientific Workflow Performance using Policy Based Data Placement, IEEE Policy 2012. §  Fault Tolerant Clustering in Scientific Workflows, SWF, IEEE Services 2012 §  Workflow Overhead Analysis and Optimizations, WORKS 2011. §  Partitioning and Scheduling Workflows across Multiple Sites with Storage Constraints, PPAM 2011 §  Pegasus: a Framework for Mapping Complex Scientific Workflows onto Distributed Systems, Scientific Programming Journal 2005 26